Instrumenting Celery with a Prometheus exporter

This guide sits within Prometheus metrics for workers and the broader topic of observability for job queues, and walks through exposing Celery task metrics to Prometheus end to end.

Celery emits no Prometheus metrics out of the box. Without instrumentation you are blind to task throughput, failure rates, runtime distributions, and worker saturation — the four signals you need before you can alert on anything. The symptom is familiar: a queue silently backs up, tasks start timing out, and the first you hear of it is a customer ticket. This page fixes that by exposing a /metrics endpoint that Prometheus scrapes, using either the standalone celery-exporter (which listens to the Celery event bus) or an in-process prometheus_client collector driven by Celery signals.

Prerequisites

A running Celery 5.x application with a Redis or RabbitMQ broker. If you have not set one up, follow setting up Celery with a Redis broker.
Worker events enabled (task_send_sent_event / worker_send_task_events), required only for the standalone exporter approach.
A reachable Prometheus server (v2.x) you can edit the scrape config on.
Python 3.9+ and the ability to add a dependency to your worker image.
A network path from Prometheus to the exporter port (default 9808) or to your in-process metrics port.

Step 1 — Choose an approach

The standalone celery-exporter runs as a separate process, subscribes to Celery's event stream, and aggregates metrics across all workers — ideal when you do not want to modify application code. The in-process prometheus_client approach hooks Celery signals directly inside the worker and gives you fully custom metrics at the cost of a small code change and a metrics port per worker.

# Option A: standalone exporter (no app code change)
pip install celery-exporter

# Option B: in-process instrumentation (custom metrics)
pip install prometheus_client

Use Option A for a quick, broker-wide rollout; use Option B when you need bespoke labels or histogram buckets tuned to your workloads.

Step 2a — Run the standalone celery-exporter

The exporter connects to the same broker your workers use and reads the event stream. It needs events turned on in your Celery config.

# celeryconfig.py — required for the standalone exporter to see tasks
worker_send_task_events = True   # emit task-started/succeeded/failed events
task_send_sent_event = True      # emit a task-sent event from producers too

# Run the exporter pointed at your broker; it exposes /metrics on :9808
celery-exporter \
  --broker-url="redis://redis:6379/0" \
  --port=9808 \
  --buckets="0.1,0.5,1,2.5,5,10,30,60"   # histogram buckets in seconds

The exporter now publishes series such as celery_task_succeeded_total, celery_task_failed_total, celery_task_runtime_seconds_bucket, and celery_worker_up, all labelled by name (task name) and queue.

Step 2b — Or instrument in-process with signals

If you prefer custom metrics, connect Celery's task_prerun, task_postrun, and task_failure signals to prometheus_client collectors and start an HTTP server inside the worker boot.

# metrics.py — in-process Celery instrumentation
import time
from prometheus_client import Counter, Histogram, start_http_server
from celery.signals import (
    task_prerun, task_postrun, task_failure, worker_process_init,
)

TASK_RUNTIME = Histogram(
    "celery_task_runtime_seconds",
    "Task execution time in seconds",
    ["task_name", "queue"],
    buckets=(0.1, 0.5, 1, 2.5, 5, 10, 30, 60),
)
TASKS_TOTAL = Counter(
    "celery_tasks_total", "Tasks by terminal state",
    ["task_name", "queue", "state"],
)
_start_times = {}

@worker_process_init.connect
def _start_metrics_server(**_):
    # One metrics endpoint per worker process; pick a port per concurrency slot
    start_http_server(9809)

@task_prerun.connect
def _on_prerun(task_id, task, **_):
    _start_times[task_id] = time.perf_counter()

@task_postrun.connect
def _on_postrun(task_id, task, state, **_):
    started = _start_times.pop(task_id, None)
    queue = getattr(task.request, "delivery_info", {}).get("routing_key", "default")
    if started is not None:
        TASK_RUNTIME.labels(task.name, queue).observe(time.perf_counter() - started)
    TASKS_TOTAL.labels(task.name, queue, state or "UNKNOWN").inc()

@task_failure.connect
def _on_failure(task_id, sender, **_):
    queue = getattr(sender.request, "delivery_info", {}).get("routing_key", "default")
    TASKS_TOTAL.labels(sender.name, queue, "FAILURE").inc()

# celery_app.py — import metrics so the signal handlers register
from celery import Celery
import metrics  # noqa: F401  (side-effect import wires the signals)

app = Celery("tasks", broker="redis://redis:6379/0")

With --concurrency using the prefork pool, each child process needs its own port (e.g. derive it from the process index) or you will hit "address already in use". For a single metrics endpoint per host regardless of concurrency, prefer the standalone exporter in Step 2a.

Step 3 — Expose and verify the endpoint locally

Start a worker and curl the endpoint before touching Prometheus. This isolates instrumentation bugs from scrape-config bugs.

celery -A celery_app worker --loglevel=info --concurrency=1 &
sleep 3
curl -s http://localhost:9809/metrics | grep celery_   # in-process
# or for the standalone exporter:
curl -s http://localhost:9808/metrics | grep celery_

You should see celery_tasks_total, celery_task_runtime_seconds_bucket, and # HELP / # TYPE lines. Fire a task and re-curl to confirm counters increment.

Step 4 — Add the target to Prometheus

Point Prometheus at the exporter. Use a 15s scrape interval — fine-grained enough to catch fast-moving backlogs without overloading the exporter.

# prometheus.yml — scrape config
scrape_configs:
  - job_name: "celery"
    scrape_interval: 15s
    static_configs:
      - targets: ["celery-exporter:9808"]   # or your worker host:9809
        labels:
          service: "payments-workers"

For dynamic worker fleets on Kubernetes, replace static_configs with kubernetes_sd_configs and a pod annotation like prometheus.io/scrape: "true"; this pairs well with auto-scaling Celery workers on Kubernetes.

# Reload Prometheus without a restart (requires --web.enable-lifecycle)
curl -X POST http://prometheus:9090/-/reload

Step 5 — Verify metrics in Prometheus

Confirm the target is healthy and series are flowing. Open http://prometheus:9090/targets and check the celery job shows UP. Then run a query in the expression browser or via the HTTP API.

# Per-task success rate over 5 minutes — should be non-zero once tasks run
sum by (task_name) (rate(celery_tasks_total{state="SUCCESS"}[5m]))

# Scripted verification via the Prometheus HTTP API
curl -sG http://prometheus:9090/api/v1/query \
  --data-urlencode 'query=up{job="celery"}' | python -m json.tool
# Expect: data.result[].value[1] == "1"

A 1 for up{job="celery"} plus a non-empty rate query confirms the full path: worker → exporter → scrape → queryable series.

Gotchas and edge cases

Cardinality blow-up from task IDs. Never label metrics with task_id, user IDs, or any unbounded value. Each unique label set is a new time series; a task_id label can mint millions of series and OOM Prometheus. Keep labels to task_name, queue, and state.

Counters reset on worker restart. celery_tasks_total resets to zero when a worker process recycles (e.g. after worker_max_tasks_per_child). This is expected — always wrap counters in rate() or increase(), which are restart-aware, rather than reading raw values.

Prefork pool port collisions. With the in-process approach and --concurrency > 1, every prefork child runs start_http_server. Either bind a distinct port per child or use a multiprocess-mode prometheus_client directory (PROMETHEUS_MULTIPROC_DIR) so all children share one endpoint. The standalone exporter sidesteps this entirely.

Events disabled means an empty exporter. The standalone celery-exporter shows zero task metrics if worker_send_task_events is off. If /metrics only returns celery_worker_up and nothing else, check that events are enabled and that the exporter points at the same broker and vhost as your workers.

Alerting on queue backlog with Prometheus — turn these metrics into PromQL alert rules and Alertmanager routes.
Prometheus metrics for workers — the metrics, labels, and recording rules that underpin worker observability.
Observability and monitoring for job queues — how metrics, dashboards, and tracing fit together across a queue platform.
Celery architecture and configuration — broker, pool, and routing settings that shape what your metrics show.