Alerting on queue backlog with Prometheus

This page is part of Prometheus metrics for workers within the wider topic of observability for job queues, and shows how to alert on a backlog before it becomes an incident.

A backlog is the earliest reliable signal that workers cannot keep up. Queue depth climbs, end-to-end latency stretches from seconds to hours, and downstream SLAs break — yet a naive queue_depth > 1000 threshold either misses gradual saturation or pages you at 3 a.m. for a harmless batch spike. The goal here is a small set of PromQL rules that distinguish transient depth (self-correcting) from sustained growth (a real consumer shortfall), routed through Alertmanager without flapping.

Prerequisites

  • Worker metrics already in Prometheus. If not, start with instrumenting Celery with a Prometheus exporter.
  • A queue-depth gauge such as celery_queue_length (from celery-exporter) or a broker exporter (redis_exporter, rabbitmq_exporter) exposing per-queue message counts.
  • A task throughput counter (celery_tasks_total) and a retry counter or state="RETRY" series.
  • Prometheus rule-file loading enabled and a running Alertmanager.
  • A notification receiver (Slack, PagerDuty, email) reachable from Alertmanager.

Step 1 — Confirm the metrics you will alert on

Alerts are only as good as the underlying series. Verify each one returns data before writing rules against it.

# Current depth per queue — the primary backlog signal
celery_queue_length

# Net growth rate: arrivals minus departures, messages/sec, per queue
sum by (queue) (rate(celery_tasks_total{state="SENT"}[5m]))
  - sum by (queue) (rate(celery_tasks_total{state="SUCCESS"}[5m]))

# Retry pressure: fraction of outcomes that are retries
sum by (task_name) (rate(celery_tasks_total{state="RETRY"}[5m]))
  / sum by (task_name) (rate(celery_tasks_total[5m]))

If celery_queue_length is empty, your exporter is not reading queue lengths — switch to a broker-side exporter that does, or compute depth from broker metrics directly.

Step 2 — Add recording rules for stable signals

Raw ratios are noisy and recomputed on every alert evaluation. Pre-aggregate them as recording rules so alert expressions stay cheap and consistent.

# rules/queue_recording.yml
groups:
  - name: queue_recording
    interval: 30s
    rules:
      - record: queue:net_growth:rate5m
        expr: |
          sum by (queue) (rate(celery_tasks_total{state="SENT"}[5m]))
            - sum by (queue) (rate(celery_tasks_total{state="SUCCESS"}[5m]))
      - record: queue:retry_ratio:rate5m
        expr: |
          sum by (task_name) (rate(celery_tasks_total{state="RETRY"}[5m]))
            / clamp_min(sum by (task_name) (rate(celery_tasks_total[5m])), 0.001)

The clamp_min guard prevents a divide-by-zero NaN when a task is idle, which would otherwise make the retry-ratio alert flap as series appear and vanish.

Step 3 — Write the alerting rules

Three rules cover the common failure modes: sustained backlog growth, an absolute depth ceiling, and elevated retries. Each uses for: to require the condition to hold before firing, which is the single most effective anti-flapping control.

# rules/queue_alerts.yml
groups:
  - name: queue_alerts
    rules:
      - alert: QueueBacklogGrowing
        # Arrivals outpacing completions for a sustained window, not a blip
        expr: queue:net_growth:rate5m > 5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Queue {{ $labels.queue }} is growing"
          description: "Net growth {{ $value }} msg/s sustained for 10m."

      - alert: QueueBacklogCritical
        # Absolute depth ceiling: a hard SLA-derived limit
        expr: celery_queue_length > 10000
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Queue {{ $labels.queue }} depth critical"
          description: "{{ $value }} messages pending on {{ $labels.queue }}."

      - alert: HighRetryRate
        expr: queue:retry_ratio:rate5m > 0.25
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.task_name }} retry rate elevated"
          description: ">25% of {{ $labels.task_name }} outcomes are retries."

Prefer a rate-of-growth alert (QueueBacklogGrowing) over a static threshold wherever possible: it catches creeping saturation hours before a fixed ceiling would, and it does not page during a planned batch that drains on its own.

Step 4 — Predict time-to-overflow (optional but powerful)

For SLA-bound queues, alert on when the backlog will breach a limit rather than the current value. predict_linear extrapolates the trend.

# Fire if the queue is on track to exceed 50k within 30 minutes (1800s)
predict_linear(celery_queue_length[20m], 1800) > 50000

This gives operators lead time to scale workers — see auto-scaling Celery workers on Kubernetes — before the SLA is actually at risk.

Step 5 — Wire Alertmanager and suppress flapping

Load the rules, then route them. Alertmanager's group_wait, group_interval, and repeat_interval control noise; inhibit_rules stop a warning from paging once its critical sibling already has.

# prometheus.yml — load the rule files and point at Alertmanager
rule_files:
  - "rules/queue_recording.yml"
  - "rules/queue_alerts.yml"
alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]
# alertmanager.yml
route:
  group_by: ["queue"]
  group_wait: 30s        # batch alerts that fire together
  group_interval: 5m     # min time between notifications for a group
  repeat_interval: 4h    # re-notify only every 4h while still firing
  receiver: slack-default
  routes:
    - matchers: ['severity="critical"']
      receiver: pagerduty
inhibit_rules:
  # A critical backlog mutes the warning-level growth alert for the same queue
  - source_matchers: ['alertname="QueueBacklogCritical"']
    target_matchers: ['alertname="QueueBacklogGrowing"']
    equal: ["queue"]
receivers:
  - name: slack-default
    slack_configs:
      - api_url: "https://hooks.slack.com/services/XXX/YYY/ZZZ"
        channel: "#queue-alerts"
  - name: pagerduty
    pagerduty_configs:
      - routing_key: "your-pagerduty-integration-key"

Verification

Confirm rules are loaded, then force an alert through to the receiver. Do not wait for a real backlog to discover the wiring is broken.

# 1. Rules parse and are loaded
curl -s http://prometheus:9090/api/v1/rules | python -m json.tool | grep alertname

# 2. Validate rule files offline before deploying
promtool check rules rules/queue_alerts.yml

# 3. Confirm Alertmanager received and routed a firing alert
curl -s http://alertmanager:9093/api/v2/alerts | python -m json.tool

A successful promtool check rules, a firing state visible in /api/v1/rules, and a test message in your Slack/PagerDuty receiver confirm the full pipeline.

Gotchas and edge cases

Static thresholds that never fit. A single > 10000 ceiling is wrong for both a high-volume ingest queue and a low-volume billing queue. Set per-queue thresholds via label-matched rules, or normalize against capacity (depth / active-worker count) so one rule scales across queues.

Flapping from short windows and no for. An expression without a for: clause fires on a single noisy evaluation. Always combine a [5m]+ rate window with for: 5m15m. The window smooths the metric; for requires persistence — you need both.

Stale series after a queue is deleted. When a queue is removed, its celery_queue_length series can go stale rather than to zero, leaving an alert stuck firing. Add unless (absent_over_time(...)) guards, or rely on Prometheus staleness handling, and confirm deleted queues resolve.

Alerting on counters without rate(). celery_tasks_total is a monotonic counter that resets on worker restart. Comparing its raw value to a threshold produces nonsense; always wrap counters in rate() or increase(), both of which handle resets correctly.

FAQ

Should I alert on queue depth or on age of the oldest message? Both, for different reasons. Depth catches throughput shortfalls quickly, while oldest-message age (consumer lag in time) directly maps to your SLA — "no job waits more than 5 minutes." If your broker exposes a per-message-age or lag metric, alert on it as the primary SLA signal and keep depth as a leading indicator.

How do I avoid paging during expected nightly batches? Use a time-based mute in Alertmanager (time_intervals with a mute_time_intervals route) for the batch window, or rely on the predict_linear and net-growth rules, which tolerate a spike that drains on its own. Avoid raising static thresholds globally just to survive the batch.

Related