Eventing

This section contains the following topics:

Event Performance Guidelines

Performance Management Events

Baseline Averages

How to Monitor Device Performance Using Events

Monitoring Metrics with Event Rules

View Events

How to Configure Notifications from Event Manager

Event Performance Guidelines

The following configuration was used to validate and benchmark event performance:

A system in full conformance with recommended specifications for a “medium” production system of 500K polled items (referring to system sizing specifications).
10 event rules, spread over 7 monitoring profiles that are being used on polled items.
- There was 1 event rule being evaluated at the 1-minute rate on a metric family comprising ~33 percent of our polled items.
- There was 1 event rule being evaluated at the 15-minute rate on a metric family comprising ~33 percent of our polled items.
- The remaining rules were applied to a portion of the remaining items being polled at 5 minutes.
- The event rules were spread out evenly over 4 metric families.
- Each rule had 1 fixed condition and 1 standard deviation condition.
- 6 event rules had a duration of 5 minutes and window of 15 minutes.
- 4 event rules had a duration of 15 minutes and window of 60 minutes.
Note: For optimal performance, minimize the number of monitoring profiles that have event rules for the same metric family. For example, one monitoring profile with ten rules for the Interfaces metric family will perform better than ten monitoring profiles with one rule for Interfaces metric family, when applied to the same set of devices.
100K polled items had a varying number of event rules that were associated to them.
There were 5 Data Collector systems, each polling approximately 1/5th of the items.

How to Monitor Event Processing

To determine if you are doing too much eventing, you need to monitor a few key performance indicators in Data Aggregator. Eventing in Data Aggregator is performed in batches (such as, events are evaluated and generated for large groups of items at once). For this reason, we used a variety of metrics that were tracked through the Data Aggregator system’s self-monitoring mechanism to assess the health of the Data Aggregator system. To view these important metrics, add a custom IM Device MultiTrend view to a dashboard. Edit the dashboard, using the following metrics from the metric family Data Aggregator Event Calculation Times:

Event Process Queue Size – Shows the size of the event processing queue. A constant value of zero, one, or two indicates that this system is in good health and is able to maintain current eventing. A constant value larger than 2 indicates that the system is able to maintain current eventing loads, although the system is potentially behind (processing polls older than the current poll cycle). An increase in queue size without a subsequent recovery (trending downward) indicates that eventing is backed up and your system may be at risk.
The following two metrics complement each other.
- Count of Cleared Events – Number of cleared events that are in the reporting resolution window.
- Count of Created Events – Number of raised events that are in the reporting resolution window.
A continuously large number of events that are raised or cleared can impact the Event Manager database.

If the combined total of these two metrics goes over 900 events in a 5-minute poll cycle, then you have exceeded the recommended 2-3 events per second generation rate recommended for medium systems. Event generation/clear bursts over the 900 events in a 5-minute poll cycle is acceptable.
Count of Processed Event Rule Evaluations – An event rule evaluation is the evaluation of a single event rule against a single item. This metric tracks the sum of event rules, multiplied by the number of items those rules are applied to. The higher the number of evaluations, the more work your system is doing. However, not all evaluations are created equal. For example, evaluations with more conditions, more standard deviation conditions, or longer duration and window are more expensive than those evaluations with fewer, fixed conditions using a smaller duration and window. As such, you may be able to do more or less evaluations, depending on your event rules.
In our test environment, as described previously, we saw that exceeding 150K evaluations for a 5-minute poll cycle put the system at risk.
Total Time to Calculate Events – Total amount of time that was spent processing events for this metric family. If this number exceeds the number of seconds in the reporting resolution window, then it is an indication that eventing was delayed or backlogged at that point in time.

By watching all of these metrics over time you can judge the health of event performance on your system. Additionally, if the Karaf log on the Data Aggregator system contains database and/or other errors, this can be an indication of a system under stress. In general, these self-monitored metrics should be steady. However during the evening hours (by default between 2 and 4 AM UTC), some database intensive jobs are run which can cause fluctuations in the self-monitored metrics. If the metrics return to a steady state, the system can be considered still in good health (although events can be delayed during the time the system is busy).

We recommend that you turn on eventing slowly and judge the system health before moving forward with different rules. We also recommend that you monitor the health of the system over 24 hours after each subsequent change, as there is nightly processing that can have an impact even though eventing may appear steady through-out the day-time hours.

How to Remediate When the Threshold is Exceeded

To remediate when you exceed the threshold, follow this process:

Turn off event rules one at a time. Check the performance after you turn off each rule before turning off another rule.
Reduce the number of items being polled.
Reduce the number of monitoring profiles with event rules that are polling items.
If these steps do not improve the performance, contact CA Support.