Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: label mismatch for alertmanager_notifications_failed_total #3599

Merged
merged 1 commit into from
Nov 14, 2023

Conversation

cr7258
Copy link
Contributor

@cr7258 cr7258 commented Nov 10, 2023

Since Alertmanager 0.26.0 version, a new label reason was added in the alertmanager_notifications_failed_total metric to indicate the type of error of the alert delivery.

As a result, the original alert rules are broken because labels are mismatched between alertmanager_notifications_failed_total and alertmanager_notifications_total metrics.
Prometheus requires samples with exactly the same labels to get matched together when performing calculations. docs

Use the ignoring vector matching keyword to ignore the new reason label to allow for matching between series with different labels.

Copy link
Member

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing it!

@@ -44,7 +44,7 @@
(
rate(alertmanager_notifications_failed_total{%(alertmanagerSelector)s}[5m])
/
rate(alertmanager_notifications_total{%(alertmanagerSelector)s}[5m])
on (integration) group_left rate(alertmanager_notifications_total{%(alertmanagerSelector)s}[5m])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to ignore the reason label since it's the only one that's different between the 2 metrics. It will change a bit the alert condition in case of multiple reasons (e.g. the sum for all reasons might be above the threshold while the individual reasons are below). But the current threshold value is low enough that it shouldn't be an issue in practice.

Suggested change
on (integration) group_left rate(alertmanager_notifications_total{%(alertmanagerSelector)s}[5m])
ignoring (reason) group_left rate(alertmanager_notifications_total{%(alertmanagerSelector)s}[5m])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your suggestion. Fixed.

Copy link
Member

@gotjosh gotjosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@simonpasquier simonpasquier merged commit 716830a into prometheus:main Nov 14, 2023
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants