Alerting on queue backlog with Prometheus
This page is part of Prometheus metrics for workers within the wider topic of observability for job queues, and shows how to alert on a backlog before it becomes an incident.
A backlog is the earliest reliable signal that workers cannot keep up. Queue depth climbs, end-to-end latency stretches from seconds to hours, and downstream SLAs break — yet a naive queue_depth > 1000 threshold either misses gradual saturation or pages you at 3 a.m. for a harmless batch spike. The goal here is a small set of PromQL rules that distinguish transient depth (self-correcting) from sustained growth (a real consumer shortfall), routed through Alertmanager without flapping.
Prerequisites
- Worker metrics already in Prometheus. If not, start with instrumenting Celery with a Prometheus exporter.
- A queue-depth gauge such as
celery_queue_length(fromcelery-exporter) or a broker exporter (redis_exporter,rabbitmq_exporter) exposing per-queue message counts. - A task throughput counter (
celery_tasks_total) and a retry counter orstate="RETRY"series. - Prometheus rule-file loading enabled and a running Alertmanager.
- A notification receiver (Slack, PagerDuty, email) reachable from Alertmanager.
Step 1 — Confirm the metrics you will alert on
Alerts are only as good as the underlying series. Verify each one returns data before writing rules against it.
# Current depth per queue — the primary backlog signal
celery_queue_length
# Net growth rate: arrivals minus departures, messages/sec, per queue
sum by (queue) (rate(celery_tasks_total{state="SENT"}[5m]))
- sum by (queue) (rate(celery_tasks_total{state="SUCCESS"}[5m]))
# Retry pressure: fraction of outcomes that are retries
sum by (task_name) (rate(celery_tasks_total{state="RETRY"}[5m]))
/ sum by (task_name) (rate(celery_tasks_total[5m]))
If celery_queue_length is empty, your exporter is not reading queue lengths — switch to a broker-side exporter that does, or compute depth from broker metrics directly.
Step 2 — Add recording rules for stable signals
Raw ratios are noisy and recomputed on every alert evaluation. Pre-aggregate them as recording rules so alert expressions stay cheap and consistent.
# rules/queue_recording.yml
groups:
- name: queue_recording
interval: 30s
rules:
- record: queue:net_growth:rate5m
expr: |
sum by (queue) (rate(celery_tasks_total{state="SENT"}[5m]))
- sum by (queue) (rate(celery_tasks_total{state="SUCCESS"}[5m]))
- record: queue:retry_ratio:rate5m
expr: |
sum by (task_name) (rate(celery_tasks_total{state="RETRY"}[5m]))
/ clamp_min(sum by (task_name) (rate(celery_tasks_total[5m])), 0.001)
The clamp_min guard prevents a divide-by-zero NaN when a task is idle, which would otherwise make the retry-ratio alert flap as series appear and vanish.
Step 3 — Write the alerting rules
Three rules cover the common failure modes: sustained backlog growth, an absolute depth ceiling, and elevated retries. Each uses for: to require the condition to hold before firing, which is the single most effective anti-flapping control.
# rules/queue_alerts.yml
groups:
- name: queue_alerts
rules:
- alert: QueueBacklogGrowing
# Arrivals outpacing completions for a sustained window, not a blip
expr: queue:net_growth:rate5m > 5
for: 10m
labels:
severity: warning
annotations:
summary: "Queue {{ $labels.queue }} is growing"
description: "Net growth {{ $value }} msg/s sustained for 10m."
- alert: QueueBacklogCritical
# Absolute depth ceiling: a hard SLA-derived limit
expr: celery_queue_length > 10000
for: 5m
labels:
severity: critical
annotations:
summary: "Queue {{ $labels.queue }} depth critical"
description: "{{ $value }} messages pending on {{ $labels.queue }}."
- alert: HighRetryRate
expr: queue:retry_ratio:rate5m > 0.25
for: 15m
labels:
severity: warning
annotations:
summary: "{{ $labels.task_name }} retry rate elevated"
description: ">25% of {{ $labels.task_name }} outcomes are retries."
Prefer a rate-of-growth alert (QueueBacklogGrowing) over a static threshold wherever possible: it catches creeping saturation hours before a fixed ceiling would, and it does not page during a planned batch that drains on its own.
Step 4 — Predict time-to-overflow (optional but powerful)
For SLA-bound queues, alert on when the backlog will breach a limit rather than the current value. predict_linear extrapolates the trend.
# Fire if the queue is on track to exceed 50k within 30 minutes (1800s)
predict_linear(celery_queue_length[20m], 1800) > 50000
This gives operators lead time to scale workers — see auto-scaling Celery workers on Kubernetes — before the SLA is actually at risk.
Step 5 — Wire Alertmanager and suppress flapping
Load the rules, then route them. Alertmanager's group_wait, group_interval, and repeat_interval control noise; inhibit_rules stop a warning from paging once its critical sibling already has.
# prometheus.yml — load the rule files and point at Alertmanager
rule_files:
- "rules/queue_recording.yml"
- "rules/queue_alerts.yml"
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]
# alertmanager.yml
route:
group_by: ["queue"]
group_wait: 30s # batch alerts that fire together
group_interval: 5m # min time between notifications for a group
repeat_interval: 4h # re-notify only every 4h while still firing
receiver: slack-default
routes:
- matchers: ['severity="critical"']
receiver: pagerduty
inhibit_rules:
# A critical backlog mutes the warning-level growth alert for the same queue
- source_matchers: ['alertname="QueueBacklogCritical"']
target_matchers: ['alertname="QueueBacklogGrowing"']
equal: ["queue"]
receivers:
- name: slack-default
slack_configs:
- api_url: "https://hooks.slack.com/services/XXX/YYY/ZZZ"
channel: "#queue-alerts"
- name: pagerduty
pagerduty_configs:
- routing_key: "your-pagerduty-integration-key"
Verification
Confirm rules are loaded, then force an alert through to the receiver. Do not wait for a real backlog to discover the wiring is broken.
# 1. Rules parse and are loaded
curl -s http://prometheus:9090/api/v1/rules | python -m json.tool | grep alertname
# 2. Validate rule files offline before deploying
promtool check rules rules/queue_alerts.yml
# 3. Confirm Alertmanager received and routed a firing alert
curl -s http://alertmanager:9093/api/v2/alerts | python -m json.tool
A successful promtool check rules, a firing state visible in /api/v1/rules, and a test message in your Slack/PagerDuty receiver confirm the full pipeline.
Gotchas and edge cases
Static thresholds that never fit. A single > 10000 ceiling is wrong for both a high-volume ingest queue and a low-volume billing queue. Set per-queue thresholds via label-matched rules, or normalize against capacity (depth / active-worker count) so one rule scales across queues.
Flapping from short windows and no for. An expression without a for: clause fires on a single noisy evaluation. Always combine a [5m]+ rate window with for: 5m–15m. The window smooths the metric; for requires persistence — you need both.
Stale series after a queue is deleted. When a queue is removed, its celery_queue_length series can go stale rather than to zero, leaving an alert stuck firing. Add unless (absent_over_time(...)) guards, or rely on Prometheus staleness handling, and confirm deleted queues resolve.
Alerting on counters without rate(). celery_tasks_total is a monotonic counter that resets on worker restart. Comparing its raw value to a threshold produces nonsense; always wrap counters in rate() or increase(), both of which handle resets correctly.
FAQ
Should I alert on queue depth or on age of the oldest message? Both, for different reasons. Depth catches throughput shortfalls quickly, while oldest-message age (consumer lag in time) directly maps to your SLA — "no job waits more than 5 minutes." If your broker exposes a per-message-age or lag metric, alert on it as the primary SLA signal and keep depth as a leading indicator.
How do I avoid paging during expected nightly batches?
Use a time-based mute in Alertmanager (time_intervals with a mute_time_intervals route) for the batch window, or rely on the predict_linear and net-growth rules, which tolerate a spike that drains on its own. Avoid raising static thresholds globally just to survive the batch.
Related
- Instrumenting Celery with a Prometheus exporter — produce the queue-depth and task metrics these alerts consume.
- Prometheus metrics for workers — the metric and label model behind effective worker alerting.
- Observability and monitoring for job queues — where alerting fits alongside dashboards and tracing.
- Auto-scaling Celery workers on Kubernetes — act on backlog alerts by adding consumer capacity automatically.