Prometheus Alerting Rules Collection
Welcome to the Prometheus Alerting Rules Collection.

Join us
Do you want to help us, join us, or learn more? Everything happens here: https://doc.aucoeurdu.cloud/contribuer/
Node Exporter
Alerting rules for nodes (Node Exporter).
NodeFilesystemAlmostOutOfSpace
Alert when there is less than 10% free disk space.
- alert: NodeFilesystemAlmostOutOfSpace
expr: node_filesystem_avail_bytes{fstype!=""} / node_filesystem_size_bytes{fstype!=""} * 100 < 10
for: 30m
labels:
severity: warning
annotations:
summary: "Filesystem has less than 10% space left"
description: "Filesystem {{ $labels.device }} mounted on {{ $labels.mountpoint }} has only {{ $value }}% space left."
Blackbox Exporter
Alerting rules for Blackbox Exporter.
BlackboxProbeHttpFailure
Alert when the HTTP status code is not 200-399.
- alert: BlackboxProbeHttpFailure
expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400
for: 0m
labels:
severity: critical
annotations:
summary: Blackbox probe HTTP failure (instance {{ $labels.instance }})
description: "HTTP status code is not 200-399\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
BlackboxSlowProbe
Alert when the average probe duration over 1 minute is greater than 1 second.
- alert: BlackboxSlowProbe
expr: avg_over_time(probe_duration_seconds[1m]) > 1
for: 1m
labels:
severity: warning
annotations:
summary: Blackbox slow probe (instance {{ $labels.instance }})
description: "Blackbox probe took more than 1s to complete\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
BlackboxSslCertificateWillExpireSoon (Warning)
Alert when the SSL certificate expires in less than 30 days.
- alert: BlackboxSslCertificateWillExpireSoon
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
for: 0m
labels:
severity: warning
annotations:
summary: Blackbox SSL certificate will expire soon (instance {{ $labels.instance }})
description: "SSL certificate expires in 30 days\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
BlackboxSslCertificateWillExpireSoon (Critical)
Alert when the SSL certificate expires in less than 3 days.
- alert: BlackboxSslCertificateWillExpireSoon
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 3
for: 0m
labels:
severity: critical
annotations:
summary: Blackbox SSL certificate will expire soon (instance {{ $labels.instance }})
description: "SSL certificate expires in 3 days\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
BlackboxSslCertificateExpired
Alert when the SSL certificate has expired.
- alert: BlackboxSslCertificateExpired
expr: probe_ssl_earliest_cert_expiry - time() <= 0
for: 0m
labels:
severity: critical
annotations:
summary: Blackbox SSL certificate expired (instance {{ $labels.instance }})
description: "SSL certificate has expired already\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
Rudder Exporter
Alerting rules for Rudder (using prometheus-rudder-exporter).
RudderDown
Alert when the Rudder API is down.
- alert: RudderDown
expr: rudder_up == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Rudder API is down"
description: "The Rudder API is not reachable."
RudderGlobalComplianceLow
Alert when the global compliance falls below 80%.
- alert: RudderGlobalComplianceLow
expr: rudder_global_compliance < 80
for: 10m
labels:
severity: warning
annotations:
summary: "Rudder global compliance is low"
description: "The global compliance of the infrastructure is at {{ $value }}% (below 80%)."
RudderNodeComplianceLow
Alert when a specific node compliance falls below 80%.
- alert: RudderNodeComplianceLow
expr: rudder_node_compliance < 80
for: 10m
labels:
severity: warning
annotations:
summary: "Rudder node compliance is low"
description: "The compliance of node {{ $labels.node_hostname }} is at {{ $value }}% (below 80%)."
HAProxy
Alerting rules for HAProxy (using the built-in Prometheus exporter).
HaproxyBackendMaxActiveSession>80%
Alert when the maximum active sessions for a backend server reaches 80% of the limit.
- alert: HaproxyBackendMaxActiveSession>80%
expr: ((haproxy_server_max_sessions > 0) * 100) / (haproxy_server_limit_sessions > 0) > 80
for: 2m
labels:
severity: warning
annotations:
summary: "HAProxy backend max active session > 80% (instance {{ $labels.instance }})"
description: "Session limit from backend {{ $labels.proxy }} to server {{ $labels.server }} reached 80% of limit - {{ $value | printf \"%.2f\"}}%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
HaproxyHasNoAliveBackends
Alert when an HAProxy backend has no alive servers (active or backup).
- alert: HaproxyHasNoAliveBackends
expr: haproxy_backend_active_servers + haproxy_backend_backup_servers == 0
for: 0m
labels:
severity: critical
annotations:
summary: "HAproxy has no alive backends (instance {{ $labels.instance }})"
description: "HAProxy has no alive active or backup backends for {{ $labels.proxy }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
OpenBao Exporter
Alerting rules for OpenBao (compatible with Vault metrics).
VaultSealed
Alert when the Vault/OpenBao instance is sealed.
- alert: VaultSealed
expr: vault_core_unsealed == 0
for: 0m
labels:
severity: critical
annotations:
summary: "Vault sealed (instance {{ $labels.instance }})"
description: "Vault instance is sealed on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
VaultTooManyPendingTokens
Alert when there are too many pending tokens.
- alert: VaultTooManyPendingTokens
expr: avg(vault_token_create_count - vault_token_store_count) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Vault too many pending tokens (instance {{ $labels.instance }})"
description: "Too many pending tokens {{ $labels.instance }}: {{ $value | printf \"%.2f\"}}%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
VaultTooManyInfinityTokens
Alert when there are too many tokens with infinite TTL.
- alert: VaultTooManyInfinityTokens
expr: vault_token_count_by_ttl{creation_ttl="+Inf"} > 3
for: 5m
labels:
severity: warning
annotations:
summary: "Vault too many infinity tokens (instance {{ $labels.instance }})"
description: "Too many infinity tokens {{ $labels.instance }}: {{ $value | printf \"%.2f\"}}%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
VaultClusterHealth
Alert when the cluster health is degraded (less than 50% active nodes).
- alert: VaultClusterHealth
expr: sum(vault_core_active) / count(vault_core_active) <= 0.5
for: 0m
labels:
severity: critical
annotations:
summary: "Vault cluster health (instance {{ $labels.instance }})"
description: "Vault cluster is not healthy {{ $labels.instance }}: {{ $value | printf \"%.2f\"}}%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
Unbound Exporter
Alerting rules for Unbound DNS resolver (using unbound_exporter).
UnboundResolverDown
Alert when a specific Unbound resolver node is down.
- alert: UnboundResolverDown
expr: unbound_up == 0
for: 2m
labels:
severity: critical
annotations:
summary: "DNS resolving stack: one or more nodes down"
description: "Unbound on {{ $labels.instance }} is not healthy"
UnboundResolverStackDown
Alert when the entire DNS resolving stack is down (no healthy nodes).
- alert: UnboundResolverStackDown
expr: absent(unbound_up)
for: 2m
labels:
severity: page
annotations:
summary: "DNS resolving stack is down"
description: "No more nodes in DNS resolving stack"
UnboundResolverStackResponse
Alert when the DNS resolution response time is high (requires Blackbox Exporter).
- alert: UnboundResolverStackResponse
expr: median without(instance) (probe_duration_seconds{module="dns_udp"}) > 0.2
for: 10m
labels:
severity: page
annotations:
summary: "DNS resolving stack response time is high"
description: "{{ $value | humanizeDuration }} resolving response time from {{ $labels.instance }}"
UnboundResolverStackServFail
Alert when the SERVFAIL rate is high (> 50%).
- alert: UnboundResolverStackServFail
expr: sum without(rcode) (unbound_answer_rcodes_total{rcode="SERVFAIL"}) / sum without(thread) (unbound_queries_total) > 0.5
for: 2m
labels:
severity: page
annotations:
summary: "DNS resolving stack is showing a high SERVFAIL rate"
description: "{{ $labels.instance }} is showing a SERVFAIL rate of {{ $value | humanizePercentage }} in the last 2 minutes."
Junos Exporter
Alerting rules for Juniper devices using junos_exporter.
Hardware Alarms
JunosRedAlarm
Alert when there is a red alarm on the device.
- alert: JunosRedAlarm
expr: junos_alarms_red_count > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Red alarm on {{ $labels.instance }}"
description: "Device {{ $labels.instance }} is reporting {{ $value }} red alarms."
JunosYellowAlarm
Alert when there is a yellow alarm on the device.
- alert: JunosYellowAlarm
expr: junos_alarms_yellow_count > 0
for: 10m
labels:
severity: warning
annotations:
summary: "Yellow alarm on {{ $labels.instance }}"
description: "Device {{ $labels.instance }} is reporting {{ $value }} yellow alarms."
System Health
JunosHighCPU
Alert when CPU usage is above 80% for 5 minutes.
- alert: JunosHighCPU
expr: junos_route_engine_cpu_usage_percent > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage on {{ $labels.instance }} is at {{ $value }}%."
JunosHighMemory
Alert when memory usage is above 90% for 5 minutes.
- alert: JunosHighMemory
expr: junos_route_engine_memory_utilization_percent > 90
for: 5m
labels:
severity: warning
annotations:
summary: "High Memory usage on {{ $labels.instance }}"
description: "Memory usage on {{ $labels.instance }} is at {{ $value }}%."
Environment
JunosFanFailure
Alert when a fan is not in OK state (1).
- alert: JunosFanFailure
expr: junos_environment_fan_status != 1
for: 5m
labels:
severity: critical
annotations:
summary: "Fan failure on {{ $labels.instance }}"
description: "Fan {{ $labels.item }} on {{ $labels.instance }} is reporting status {{ $value }}."
JunosPowerSupplyFailure
Alert when a power supply (PEM) is not in OK state (1).
- alert: JunosPowerSupplyFailure
expr: junos_environment_pem_status != 1
for: 5m
labels:
severity: critical
annotations:
summary: "Power supply failure on {{ $labels.instance }}"
description: "Power supply {{ $labels.item }} on {{ $labels.instance }} is reporting status {{ $value }}."
JunosHighTemperature
Alert when temperature exceeds 50 degrees Celsius.
- alert: JunosHighTemperature
expr: junos_environment_temperature_celsius > 50
for: 5m
labels:
severity: warning
annotations:
summary: "High temperature on {{ $labels.instance }}"
description: "Temperature sensor {{ $labels.item }} on {{ $labels.instance }} is reporting {{ $value }}°C."
Network
JunosInterfaceDown
Alert when an interface is administratively up but operationally down.
- alert: JunosInterfaceDown
expr: junos_interface_admin_status == 1 and junos_interface_oper_status != 1
for: 5m
labels:
severity: warning
annotations:
summary: "Interface down on {{ $labels.instance }}"
description: "Interface {{ $labels.target_name }} on {{ $labels.instance }} is administratively up but operationally down."
JunosBGPSessionDown
Alert when a BGP session is not established (State 6).
- alert: JunosBGPSessionDown
expr: junos_bgp_session_state != 6
for: 5m
labels:
severity: critical
annotations:
summary: "BGP session down on {{ $labels.instance }}"
description: "BGP session {{ $labels.peer_address }} on {{ $labels.instance }} is not established (state {{ $value }})."
Omada Exporter
Alerting rules for TP-Link Omada Controller using omada_exporter.
Device Health
OmadaDeviceHighCPU
Alert when device CPU usage is above 80% for 5 minutes.
- alert: OmadaDeviceHighCPU
expr: omada_device_cpu_percentage > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.device }}"
description: "CPU usage on {{ $labels.device }} ({{ $labels.model }}) is at {{ $value }}%."
OmadaDeviceHighMemory
Alert when device memory usage is above 90% for 5 minutes.
- alert: OmadaDeviceHighMemory
expr: omada_device_mem_percentage > 90
for: 5m
labels:
severity: warning
annotations:
summary: "High Memory usage on {{ $labels.device }}"
description: "Memory usage on {{ $labels.device }} ({{ $labels.model }}) is at {{ $value }}%."
OmadaDeviceNeedUpgrade
Alert when a device needs a firmware upgrade.
- alert: OmadaDeviceNeedUpgrade
expr: omada_device_need_upgrade == 1
for: 1h
labels:
severity: info
annotations:
summary: "Device upgrade available for {{ $labels.device }}"
description: "Device {{ $labels.device }} ({{ $labels.model }}) has a firmware upgrade available."
Controller Health
OmadaControllerLowStorage
Alert when controller storage usage is above 90%.
- alert: OmadaControllerLowStorage
expr: (omada_controller_storage_used_bytes / omada_controller_storage_available_bytes) * 100 > 90
for: 15m
labels:
severity: warning
annotations:
summary: "Low storage on Omada Controller {{ $labels.controller_name }}"
description: "Storage usage on controller {{ $labels.controller_name }} is at {{ $value }}%."
PoE
OmadaDeviceLowPoERemaining
Alert when remaining PoE power is less than 10 Watts.
- alert: OmadaDeviceLowPoERemaining
expr: omada_device_poe_remain_watts < 10
for: 5m
labels:
severity: warning
annotations:
summary: "Low PoE power remaining on {{ $labels.device }}"
description: "Device {{ $labels.device }} has only {{ $value }}W of PoE power remaining."
Oxidized Exporter
Alerting rules for Oxidized (network device configuration backup).
OxidizedBackupFailed
Alert when an Oxidized network backup has failed for a device.
- alert: OxidizedBackupFailed
expr: oxidized_device_status{job="oxidized"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: Oxidized network backup failed
description: "Since 5mn, the backup has failed"
OxidizedBackupEmpty
Alert when an Oxidized network backup is empty (0 lines of config).
- alert: OxidizedBackupEmpty
expr: oxidized_device_config_lines{job="oxidized"} == 0
for: 5m
labels:
severity: warning
annotations:
summary: Oxidized network backup is empty
description: "Since 5mn, the backup is empty"
Mosquitto Exporter
Alerting rules for Mosquitto MQTT Broker using mosquitto-exporter.
Broker Health
MosquittoBrokerRestarted
Alert when the broker has restarted recently (uptime < 10 minutes).
- alert: MosquittoBrokerRestarted
expr: broker_uptime < 600
for: 0m
labels:
severity: info
annotations:
summary: "Mosquitto broker restarted on {{ $labels.instance }}"
description: "Mosquitto broker on {{ $labels.instance }} has been up for less than 10 minutes (uptime: {{ $value }}s)."
MosquittoNoClientsConnected
Alert when there are no clients connected for 15 minutes.
- alert: MosquittoNoClientsConnected
expr: broker_clients_connected == 0
for: 15m
labels:
severity: warning
annotations:
summary: "No clients connected to Mosquitto on {{ $labels.instance }}"
description: "Mosquitto broker on {{ $labels.instance }} has 0 connected clients."
MosquittoDroppedMessages
Alert when messages are being dropped.
- alert: MosquittoDroppedMessages
expr: rate(broker_publish_messages_dropped[5m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Mosquitto dropping messages on {{ $labels.instance }}"
description: "Mosquitto broker on {{ $labels.instance }} is dropping messages (rate: {{ $value }})."
Load
MosquittoHighMessageRate
Alert when the message reception rate is unusually high (adjust threshold as needed).
- alert: MosquittoHighMessageRate
expr: rate(broker_messages_received[5m]) > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "High message rate on Mosquitto {{ $labels.instance }}"
description: "Mosquitto broker on {{ $labels.instance }} is receiving {{ $value }} messages per second."
Contributing
We welcome contributions to the Prometheus Alerting Rules Collection!

Repository
You can find the source code and contribute on our GitHub repository: https://github.com/cloudducoeur/collection-prometheus-alerting-rules
How to contribute
- Fork the repository.
- Create a new branch for your feature or bug fix.
- Add your changes.
- Submit a pull request.