Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: slack_configs.send_resolved a templatable field #4033

Open
dwilliams782 opened this issue Sep 19, 2024 · 11 comments
Open

Feature request: slack_configs.send_resolved a templatable field #4033

dwilliams782 opened this issue Sep 19, 2024 · 11 comments

Comments

@dwilliams782
Copy link

What did you do?

I'm trying to set up a receiver that can take a dynamic value for send_resolved, such as:

    ## Route alerts based on slack_channel label
      - name: global_slack_channel_label
        slack_configs:
          - api_url: <some-url>
            channel: '{{ printf \"#%s\" (index ((index .Alerts 0).Labels) \"slack_channel\") }}'
            send_resolved: {{ (index ((index .Alerts 0).Labels)) "send_resolved") | quote | default true | replace "\"" ""  }}

With this, we get a line 894: cannot unmarshal !!map into bool as Alertmanager requires this field to be boolean instead of a tmpl_string. There are various approaches to how this could be done so I'm not worried about this particular template working or not, but the send_resolved field does not allow templating as-is.

What did you expect to see?

I would like to be able to render true/false boolean based on a label in an alert. This would reduce the number of receivers required.

@grobinson-grafana
Copy link
Contributor

I think, at least for me, it is hard to understand what the use case is for this feature. Making notification decisions based on individual alerts also doesn't fit well in the Alertmanager design, which is based around grouping related alerts together. It would also be very easy for someone to write an invalid template and accidentally break all their notifications without realizing it.

@dwilliams782
Copy link
Author

dwilliams782 commented Oct 3, 2024

We have a centralised alertmanager with a configuration containing routes and receivers for a dozen or more teams, with hundreds of different alerts flowing to different channels.

Some alerts teams will want send_resolved notifications, other alerts won't. The only way to do this currently is to have two routes and two receivers, one with a send_resolved: false matcher on it, which will route to the receiver with send_resolved set to be false. Multiply this by the number of teams and different permutations of send_resolved true/false, you get a messy configuration.

It would massively simplify (half!) our receivers configuration if we could inherit send_resolved (and other route level attributes like group_by) from the alert labels.

@grobinson-grafana
Copy link
Contributor

Isn't this just "moving" the messiness from Alertmanager to Prometheus? Instead of a receiver per team, all alert rules must have a custom label send_resolved?

@dwilliams782
Copy link
Author

dwilliams782 commented Oct 3, 2024

It would be an optional label with send_resolved: true default in the receiver, with alerts optionally able to turn off resolved notifications, yes.

That complexity per alert makes way more sense to me than it does at a receiver level; a lot of alerts just generate noise with resolved notifications, but some are genuinely useful.

@dwilliams782
Copy link
Author

dwilliams782 commented Oct 3, 2024

Practical example of how we use dynamic configuration. Taking some generic kubernetes alert:

        - alert: CPUThrottlingHigh
          annotations:
            description: "{{ $value | humanizePercentage }} throttling of CPU in namespace {{ $labels.namespace }} for container {{ $labels.container }} in pod {{ $labels.pod }}."
            summary: Processes experience elevated CPU throttling.
            runbook: <link to our runbook>
          expr: |
            sum by (container, pod, namespace, team) (increase(container_cpu_cfs_throttled_periods_total[5m]))
            /
            sum by (container, pod, namespace, team) (increase(container_cpu_cfs_periods_total[5m])) > (25 / 100)
          for: 15m
          labels:
            group_by: pod

We want this CPUThrottlingHigh alert to be routed to the correct slack channel based on the team label in the pod, but this particular alert we want to be grouped by pod, so we get individual alerts per pod instead of grouping them and risk missing notifications (or not being alerted quick enough due to group_interval). We set a group_by: pod label on the alert, and then we then have to have a route per team like:

- receiver: slack_devops
  group_by: [alertname, pod]
  match_re:
    team: devops|platform-infrastructure|cloud-infrastructure
    group_by: pod

Being able to set the group_by attribute from the alert labels would reduce the amount of routes we require, and it's the same idea with making send_resolved templatable too. It makes perfect sense to me for each alert to be able to specify what it should be grouped by, repeat interval, group_wait, send_resolved and all those other route/receiver level attributes.

@grobinson-grafana
Copy link
Contributor

It would be an optional label with send_resolved: true default in the receiver, with alerts optionally able to turn off resolved notifications, yes.

I still think there are a lot of disadvantages here, and not a lot of advantages. For example:

  1. Having support for templates in send_resolved makes diagnosing issues like "why didn't I get a resolved notification?" much more difficult to solve, because it's no longer static, and now depends on the alerts at the time and their labels.

  2. There is an issue here where as far as I can tell, alerts in Alertmanager are unsorted. That would mean if a group has two alerts where one has send_resolved: true and the other has send_resolved: false, then a template that observes alerts using indexes would be non-deterministic:

    send_resolved: {{ (index ((index .Alerts 0).Labels)) "send_resolved") | quote | default true | replace "\"" ""  }}
    

    This would also cause problems in high availability mode where one Alertmanager is sending resolved notifications and the others are not.

  3. In your example, one team could accidentally disable resolved notifications for all other teams just by having a send_resolved: false label in one of their alerts:

    - receiver: slack_devops
      group_by: [alertname, pod]
      match_re:
        team: devops|platform-infrastructure|cloud-infrastructure
        group_by: pod
    

Are you sure this couldn't be solved a better way, for example, by provisioning your Alertmanager configuration file using automation or CI?

@dwilliams782
Copy link
Author

Are you sure this couldn't be solved a better way, for example, by provisioning your Alertmanager configuration file using automation or CI?

I'm open to suggestions. What would you recommend for controlling which alerts are resolved and which aren't?

@grobinson-grafana
Copy link
Contributor

If I understand the problem, you are concerned with having a large number of routes and receivers for the teams in your organization, and find it difficult to manage them all due to their total size.

My suggestion would be to provision your Alertmanager configuration automatically, then you don't need to edit it. For example, each team could specify their own route and receiver configurations, and then you can merge them together into a single Alertmanager configuration file using a script or automated job. You can avoid conflicts by suffixing the team name to each receiver like slack_devops slack_platform_infrastructure and then use these in the routes.

If you just want individual teams to choose between send_resolved and not configure their own routes and receivers, then just give them that option, and for each team, automatically create a receiver and route in the configuration file for them with their send_resolved preference.

@dwilliams782
Copy link
Author

It isn't as black and white as teams will or won't want resolved notifications, though. It is still very much context specific to each alert as to whether a resolved notification is required.

Even with automation and generating routes / receivers, it doesn't change the initial problem.

@grobinson-grafana
Copy link
Contributor

each alert as to whether a resolved notification is required

I can see you are thinking about it in terms of individual alerts, but Alertmanager is very much a tool for grouping alerts. Can you can rephrase your problem in terms of groups? How should a send_resolved decision be made for a group of alerts in your organization?

@mikkergimenez
Copy link

mikkergimenez commented Oct 18, 2024

I have the same use case as dwilliams. Based on the nature of the alert, some of my alerts don't want send_resolved to be set, these alerts are probably part of a group of alerts that do need send_resolved to be set. It's not like "all alerts from this service, or to this team don't need send_resolved".. It's more like "Log alerts for any service written in this way don't need send resolved". In fact, I feel like this goes against the design goal of thinking of things in terms of groups of alerts, because, I'm forced to ungroup from business logic lines and group them among technical implementation lines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants