Prometheus Alerting Rules Collection

Welcome to the Prometheus Alerting Rules Collection.

Join us

Do you want to help us, join us, or learn more? Everything happens here: https://doc.aucoeurdu.cloud/contribuer/

Node Exporter

Alerting rules for nodes (Node Exporter).

NodeFilesystemAlmostOutOfSpace

Alert when there is less than 10% free disk space.

- alert: NodeFilesystemAlmostOutOfSpace
  expr: node_filesystem_avail_bytes{fstype!=""} / node_filesystem_size_bytes{fstype!=""} * 100 < 10
  for: 30m
  labels:
    severity: warning
  annotations:
    summary: "Filesystem has less than 10% space left"
    description: "Filesystem {{ $labels.device }} mounted on {{ $labels.mountpoint }} has only {{ $value }}% space left."

Blackbox Exporter

Alerting rules for Blackbox Exporter.

BlackboxProbeHttpFailure

Alert when the HTTP status code is not 200-399.

- alert: BlackboxProbeHttpFailure
  expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: Blackbox probe HTTP failure (instance {{ $labels.instance }})
    description: "HTTP status code is not 200-399\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

BlackboxSlowProbe

Alert when the average probe duration over 1 minute is greater than 1 second.

- alert: BlackboxSlowProbe
  expr: avg_over_time(probe_duration_seconds[1m]) > 1
  for: 1m
  labels:
    severity: warning
  annotations:
    summary: Blackbox slow probe (instance {{ $labels.instance }})
    description: "Blackbox probe took more than 1s to complete\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

BlackboxSslCertificateWillExpireSoon (Warning)

Alert when the SSL certificate expires in less than 30 days.

- alert: BlackboxSslCertificateWillExpireSoon
  expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
  for: 0m
  labels:
    severity: warning
  annotations:
    summary: Blackbox SSL certificate will expire soon (instance {{ $labels.instance }})
    description: "SSL certificate expires in 30 days\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

BlackboxSslCertificateWillExpireSoon (Critical)

Alert when the SSL certificate expires in less than 3 days.

- alert: BlackboxSslCertificateWillExpireSoon
  expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 3
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: Blackbox SSL certificate will expire soon (instance {{ $labels.instance }})
    description: "SSL certificate expires in 3 days\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

BlackboxSslCertificateExpired

Alert when the SSL certificate has expired.

- alert: BlackboxSslCertificateExpired
  expr: probe_ssl_earliest_cert_expiry - time() <= 0
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: Blackbox SSL certificate expired (instance {{ $labels.instance }})
    description: "SSL certificate has expired already\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

Rudder Exporter

Alerting rules for Rudder (using prometheus-rudder-exporter).

RudderDown

Alert when the Rudder API is down.

- alert: RudderDown
  expr: rudder_up == 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Rudder API is down"
    description: "The Rudder API is not reachable."

RudderGlobalComplianceLow

Alert when the global compliance falls below 80%.

- alert: RudderGlobalComplianceLow
  expr: rudder_global_compliance < 80
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Rudder global compliance is low"
    description: "The global compliance of the infrastructure is at {{ $value }}% (below 80%)."

RudderNodeComplianceLow

Alert when a specific node compliance falls below 80%.

- alert: RudderNodeComplianceLow
  expr: rudder_node_compliance < 80
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Rudder node compliance is low"
    description: "The compliance of node {{ $labels.node_hostname }} is at {{ $value }}% (below 80%)."

HAProxy

Alerting rules for HAProxy (using the built-in Prometheus exporter).

HaproxyBackendMaxActiveSession>80%

Alert when the maximum active sessions for a backend server reaches 80% of the limit.

- alert: HaproxyBackendMaxActiveSession>80%
  expr: ((haproxy_server_max_sessions > 0) * 100) / (haproxy_server_limit_sessions > 0) > 80
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: "HAProxy backend max active session > 80% (instance {{ $labels.instance }})"
    description: "Session limit from backend {{ $labels.proxy }} to server {{ $labels.server }} reached 80% of limit - {{ $value | printf \"%.2f\"}}%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

HaproxyHasNoAliveBackends

Alert when an HAProxy backend has no alive servers (active or backup).

- alert: HaproxyHasNoAliveBackends
  expr: haproxy_backend_active_servers + haproxy_backend_backup_servers == 0
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: "HAproxy has no alive backends (instance {{ $labels.instance }})"
    description: "HAProxy has no alive active or backup backends for {{ $labels.proxy }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

OpenBao Exporter

Alerting rules for OpenBao (compatible with Vault metrics).

VaultSealed

Alert when the Vault/OpenBao instance is sealed.

- alert: VaultSealed
  expr: vault_core_unsealed == 0
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: "Vault sealed (instance {{ $labels.instance }})"
    description: "Vault instance is sealed on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

VaultTooManyPendingTokens

Alert when there are too many pending tokens.

- alert: VaultTooManyPendingTokens
  expr: avg(vault_token_create_count - vault_token_store_count) > 0
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Vault too many pending tokens (instance {{ $labels.instance }})"
    description: "Too many pending tokens {{ $labels.instance }}: {{ $value | printf \"%.2f\"}}%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

VaultTooManyInfinityTokens

Alert when there are too many tokens with infinite TTL.

- alert: VaultTooManyInfinityTokens
  expr: vault_token_count_by_ttl{creation_ttl="+Inf"} > 3
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Vault too many infinity tokens (instance {{ $labels.instance }})"
    description: "Too many infinity tokens {{ $labels.instance }}: {{ $value | printf \"%.2f\"}}%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

VaultClusterHealth

Alert when the cluster health is degraded (less than 50% active nodes).

- alert: VaultClusterHealth
  expr: sum(vault_core_active) / count(vault_core_active) <= 0.5
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: "Vault cluster health (instance {{ $labels.instance }})"
    description: "Vault cluster is not healthy {{ $labels.instance }}: {{ $value | printf \"%.2f\"}}%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

Unbound Exporter

Alerting rules for Unbound DNS resolver (using unbound_exporter).

UnboundResolverDown

Alert when a specific Unbound resolver node is down.

- alert: UnboundResolverDown
  expr: unbound_up == 0
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "DNS resolving stack: one or more nodes down"
    description: "Unbound on {{ $labels.instance }} is not healthy"

UnboundResolverStackDown

Alert when the entire DNS resolving stack is down (no healthy nodes).

- alert: UnboundResolverStackDown
  expr: absent(unbound_up)
  for: 2m
  labels:
    severity: page
  annotations:
    summary: "DNS resolving stack is down"
    description: "No more nodes in DNS resolving stack"

UnboundResolverStackResponse

Alert when the DNS resolution response time is high (requires Blackbox Exporter).

- alert: UnboundResolverStackResponse
  expr: median without(instance) (probe_duration_seconds{module="dns_udp"}) > 0.2
  for: 10m
  labels:
    severity: page
  annotations:
    summary: "DNS resolving stack response time is high"
    description: "{{ $value | humanizeDuration }} resolving response time from {{ $labels.instance }}"

UnboundResolverStackServFail

Alert when the SERVFAIL rate is high (> 50%).

- alert: UnboundResolverStackServFail
  expr: sum without(rcode) (unbound_answer_rcodes_total{rcode="SERVFAIL"}) / sum without(thread) (unbound_queries_total) > 0.5
  for: 2m
  labels:
    severity: page
  annotations:
    summary: "DNS resolving stack is showing a high SERVFAIL rate"
    description: "{{ $labels.instance }} is showing a SERVFAIL rate of {{ $value | humanizePercentage }} in the last 2 minutes."

Junos Exporter

Alerting rules for Juniper devices using junos_exporter.

Hardware Alarms

JunosRedAlarm

Alert when there is a red alarm on the device.

- alert: JunosRedAlarm
  expr: junos_alarms_red_count > 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Red alarm on {{ $labels.instance }}"
    description: "Device {{ $labels.instance }} is reporting {{ $value }} red alarms."

JunosYellowAlarm

Alert when there is a yellow alarm on the device.

- alert: JunosYellowAlarm
  expr: junos_alarms_yellow_count > 0
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Yellow alarm on {{ $labels.instance }}"
    description: "Device {{ $labels.instance }} is reporting {{ $value }} yellow alarms."

System Health

JunosHighCPU

Alert when CPU usage is above 80% for 5 minutes.

- alert: JunosHighCPU
  expr: junos_route_engine_cpu_usage_percent > 80
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High CPU usage on {{ $labels.instance }}"
    description: "CPU usage on {{ $labels.instance }} is at {{ $value }}%."

JunosHighMemory

Alert when memory usage is above 90% for 5 minutes.

- alert: JunosHighMemory
  expr: junos_route_engine_memory_utilization_percent > 90
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High Memory usage on {{ $labels.instance }}"
    description: "Memory usage on {{ $labels.instance }} is at {{ $value }}%."

Environment

JunosFanFailure

Alert when a fan is not in OK state (1).

- alert: JunosFanFailure
  expr: junos_environment_fan_status != 1
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Fan failure on {{ $labels.instance }}"
    description: "Fan {{ $labels.item }} on {{ $labels.instance }} is reporting status {{ $value }}."

JunosPowerSupplyFailure

Alert when a power supply (PEM) is not in OK state (1).

- alert: JunosPowerSupplyFailure
  expr: junos_environment_pem_status != 1
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Power supply failure on {{ $labels.instance }}"
    description: "Power supply {{ $labels.item }} on {{ $labels.instance }} is reporting status {{ $value }}."

JunosHighTemperature

Alert when temperature exceeds 50 degrees Celsius.

- alert: JunosHighTemperature
  expr: junos_environment_temperature_celsius > 50
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High temperature on {{ $labels.instance }}"
    description: "Temperature sensor {{ $labels.item }} on {{ $labels.instance }} is reporting {{ $value }}°C."

Network

JunosInterfaceDown

Alert when an interface is administratively up but operationally down.

- alert: JunosInterfaceDown
  expr: junos_interface_admin_status == 1 and junos_interface_oper_status != 1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Interface down on {{ $labels.instance }}"
    description: "Interface {{ $labels.target_name }} on {{ $labels.instance }} is administratively up but operationally down."

JunosBGPSessionDown

Alert when a BGP session is not established (State 6).

- alert: JunosBGPSessionDown
  expr: junos_bgp_session_state != 6
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "BGP session down on {{ $labels.instance }}"
    description: "BGP session {{ $labels.peer_address }} on {{ $labels.instance }} is not established (state {{ $value }})."

Omada Exporter

Alerting rules for TP-Link Omada Controller using omada_exporter.

Device Health

OmadaDeviceHighCPU

Alert when device CPU usage is above 80% for 5 minutes.

- alert: OmadaDeviceHighCPU
  expr: omada_device_cpu_percentage > 80
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High CPU usage on {{ $labels.device }}"
    description: "CPU usage on {{ $labels.device }} ({{ $labels.model }}) is at {{ $value }}%."

OmadaDeviceHighMemory

Alert when device memory usage is above 90% for 5 minutes.

- alert: OmadaDeviceHighMemory
  expr: omada_device_mem_percentage > 90
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High Memory usage on {{ $labels.device }}"
    description: "Memory usage on {{ $labels.device }} ({{ $labels.model }}) is at {{ $value }}%."

OmadaDeviceNeedUpgrade

Alert when a device needs a firmware upgrade.

- alert: OmadaDeviceNeedUpgrade
  expr: omada_device_need_upgrade == 1
  for: 1h
  labels:
    severity: info
  annotations:
    summary: "Device upgrade available for {{ $labels.device }}"
    description: "Device {{ $labels.device }} ({{ $labels.model }}) has a firmware upgrade available."

Controller Health

OmadaControllerLowStorage

Alert when controller storage usage is above 90%.

- alert: OmadaControllerLowStorage
  expr: (omada_controller_storage_used_bytes / omada_controller_storage_available_bytes) * 100 > 90
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "Low storage on Omada Controller {{ $labels.controller_name }}"
    description: "Storage usage on controller {{ $labels.controller_name }} is at {{ $value }}%."

PoE

OmadaDeviceLowPoERemaining

Alert when remaining PoE power is less than 10 Watts.

- alert: OmadaDeviceLowPoERemaining
  expr: omada_device_poe_remain_watts < 10
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Low PoE power remaining on {{ $labels.device }}"
    description: "Device {{ $labels.device }} has only {{ $value }}W of PoE power remaining."

Oxidized Exporter

Alerting rules for Oxidized (network device configuration backup).

OxidizedBackupFailed

Alert when an Oxidized network backup has failed for a device.

- alert: OxidizedBackupFailed
  expr: oxidized_device_status{job="oxidized"} == 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: Oxidized network backup failed
    description: "Since 5mn, the backup has failed"

OxidizedBackupEmpty

Alert when an Oxidized network backup is empty (0 lines of config).

- alert: OxidizedBackupEmpty
  expr: oxidized_device_config_lines{job="oxidized"} == 0
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: Oxidized network backup is empty
    description: "Since 5mn, the backup is empty"

Mosquitto Exporter

Alerting rules for Mosquitto MQTT Broker using mosquitto-exporter.

Broker Health

MosquittoBrokerRestarted

Alert when the broker has restarted recently (uptime < 10 minutes).

- alert: MosquittoBrokerRestarted
  expr: broker_uptime < 600
  for: 0m
  labels:
    severity: info
  annotations:
    summary: "Mosquitto broker restarted on {{ $labels.instance }}"
    description: "Mosquitto broker on {{ $labels.instance }} has been up for less than 10 minutes (uptime: {{ $value }}s)."

MosquittoNoClientsConnected

Alert when there are no clients connected for 15 minutes.

- alert: MosquittoNoClientsConnected
  expr: broker_clients_connected == 0
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "No clients connected to Mosquitto on {{ $labels.instance }}"
    description: "Mosquitto broker on {{ $labels.instance }} has 0 connected clients."

MosquittoDroppedMessages

Alert when messages are being dropped.

- alert: MosquittoDroppedMessages
  expr: rate(broker_publish_messages_dropped[5m]) > 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Mosquitto dropping messages on {{ $labels.instance }}"
    description: "Mosquitto broker on {{ $labels.instance }} is dropping messages (rate: {{ $value }})."

Load

MosquittoHighMessageRate

Alert when the message reception rate is unusually high (adjust threshold as needed).

- alert: MosquittoHighMessageRate
  expr: rate(broker_messages_received[5m]) > 1000
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High message rate on Mosquitto {{ $labels.instance }}"
    description: "Mosquitto broker on {{ $labels.instance }} is receiving {{ $value }} messages per second."

Contributing

We welcome contributions to the Prometheus Alerting Rules Collection!

Repository

You can find the source code and contribute on our GitHub repository: https://github.com/cloudducoeur/collection-prometheus-alerting-rules

How to contribute

  1. Fork the repository.
  2. Create a new branch for your feature or bug fix.
  3. Add your changes.
  4. Submit a pull request.