How to suppress excessive alerting using Alert periods

Applies to
ApexSQL Monitor

Summary
This article explains explain how to suppress excessive alerting when monitoring SQL Server in different environments, by defining an alert period in which a metric must exceed the threshold in order for an alert to be triggered

Description

It may be perfectly normal that some monitored metrics experience spikes or exceed defined thresholds for short periods of time. In this context, the alerts generated would be both unnecessary and unwanted aka false positives. This problem could be even more acute in situations when the metric collecting frequency is high or when monitoring many SQL Server instances.

The challenge is to configure ApexSQL Monitor to suppress the number of these false positive alerts. For this purpose, the Alert period feature can be useful

FAQs

Q: What is an Alert period?

The alert period is the minimum amount of time in which all consecutive readings must return values that exceed the configured threshold value to trigger an alert

Q: What is meant by the Metric period?

This is the frequency that the metric is measured

Q: Where is the Alert period configured?

The alert period can be configured in the Metrics window of ApexSQL Monitor.

To access the Metrics window, in the main menu select the Metrics button in the Configure group of the Home tab

Alternatively, the alert period can be defined for each raised alert after accessing the Alerts window

To access the alert window, in the main menu select the Alerts button in the Configure group of the Home tab. On selecting the specific alert in the list, the bottom left pane will display the configuration for that metric where Alert period value can be changed if needed

Q: What are the different alert period and metric period combinations that can be set?

There are three different alert suppression configuration that the user can configure: Alert period = 0, Alert period≤Period and Alert period>Period

Q: How does setting Alert periods to 0 affect alert suppression?

When alert period is set to 0, alert suppression is turned off, meaning that alert will be triggered every time when the metric value exceeds the alert threshold

Q: How does setting the Alert period to a value smaller than the metrics period affect alert suppression

If the alert period is less than the metric period, the second consecutive metric reading where the metric exceeds the threshold will trigger an alert. In this way, the application can specify that the alert will not be triggered before the minimum alert period configured is reached

Here are some examples that illustrates this


Example:
Alert period = 15 seconds
Metric period = 30 seconds

Q: How does setting the Alert period to a value larger than metric period suppress alerts

If the alert period (in this case 50 seconds) is larger than the metric period (e.g. 15 seconds), every consecutive metric reading since the first threshold breach must exceed the threshold for at least the duration of the alert period. Alerts for the first as well as for all subsequent threshold breaches within the specified time in seconds will be suppressed by alerting engine, and the first consecutive reading that exceeds the threshold after the defined alert period time will trigger the alert


Example:
Alert period = 50 seconds
Metric period = 15 seconds

In this particular case, an alert will be triggered on the fifth consecutive metric reading that exceeds the configured threshold. In this way the application can specify that the alert will not be triggered before the alert period has passed

Q: What happens if not all consecutive metric readings within the alert period are out of threshold

If any of the consecutive metric readings are below the configured threshold, the alert will not be triggered. The first consecutive metric reading that is below the threshold will reset the measuring of the alert period; the first metric reading that exceeds the threshold after that will initiate measuring a new alert period

Example:
Alert period = 50 seconds
Metric period = 15 seconds

As visible from the above image, one consecutive metric reading was below the threshold within the time defined by the alert period, resulting in the alert not being triggered on the next metric reading. Instead, the alert period measurement is reset, and a new alert period measurement has started on the first next metric reading on which the value exceeded the configured threshold.

Q: What happens when an alert is triggered and metric readings after that return values that exceeds the threshold

Once the alert of certain severity is triggered, it will not be triggered again regardless of how many times the metric consecutively exceeds the threshold. The first metric reading that is below any threshold will reset the alert period measuring

But an additional situation can be distinguished here as well, which affects alert behavior. The first metric reading that exceeds the high severity threshold will reset the alert period measurement and initiate a new period measurement only if the consecutive metric reading also exceeds the high severity threshold

Example:
Alert period = 15 seconds
Metric period = 15 seconds

A medium alert is triggered and the next two consecutive metric readings exceeded the medium severity threshold, followed by consecutive readings that exceeded the high severity threshold

As demonstrated, a medium alert is triggered after the 2nd medium threshold breach, and then alerting is suppressed for consecutive metric readings of the same (medium) threshold severity, until the first metric reading that exceeded the high severity threshold encountered, which caused the alert period measuring to be initiated again and the new high severity alert is triggered when alert period was meet

In this way, the alerting engine will ensure that higher severity alerts will never be suppressed in favor of lover severity alerts

Q: What alert will be triggered if the threshold is violated by all consecutive metric readings within the alert period, but not metric readings that breach the same threshold severity

In situations when all consecutive metric readings return a value that exceeds any one of the low, medium or high thresholds, the triggered alert will be of the lowest severity that has been read. So if within alert period low, medium and high threshold violations are encountered, the resulting alert will be low severity. Even if all metric readings are high within the alert period, except one that is low severity, the resulting alert will be of the low severity

Example:
Alert period = 50 seconds
Metric period = 15 seconds

Q: How can the alert engine specify that consecutive high threshold breaches, that are within the alert period, will trigger a high severity alert, even when readings before or after, are of mixed or lower severities

In order to prevent any suppression of the High severity alerts in favor of alerts of lower severity, the alerting engine can recognize if that one or one of the last consecutive readings within the measured alert period have exceeded the high severity threshold and to take this into account by setting additional alert period measuring

Example:
Alert period = 50 seconds
Metric period = 15 seconds

As it can be seen, the alerting engine has fired a Low severity alert on the 5th reading, but since the 5th reading and the previous both exceeded the high threshold, the new alert period measurement is set starting from the 4th reading (high threshold breach. In this way, all consequent high severity threshold violations will not be suppressed by lower severity alerts