Celery Task Retry & Error Handling

This guide expands on the retry primitives introduced in Celery Architecture & Configuration and the broader patterns in Backend Frameworks & Worker Scaling, turning ad-hoc self.retry() calls into a deterministic, observable failure-handling strategy.

A naive Celery task that calls an external API will eventually hit a ConnectionError, a 429 Too Many Requests, or a worker that gets SIGKILLed mid-execution by Kubernetes. Without explicit retry policy, transient failures become permanent task loss, and a single poison payload can be redelivered forever, saturating a queue. The symptom is usually a flood of identical tracebacks in the logs followed by missing side effects — emails never sent, payments never captured — with no record of where the work went. This page builds a layered policy: declarative autoretry for transient errors, capped exponential backoff with jitter to avoid retry storms, durable acknowledgment so crashes don't drop work, and an explicit dead-letter path for tasks that exhaust their budget.

Prerequisites

Step 1 — Declarative retries with autoretry_for

Manually wrapping every task body in try/except self.retry() is repetitive and error-prone. The autoretry_for argument on the task decorator tells Celery which exception classes should trigger an automatic retry, so the task body stays focused on business logic.

# tasks.py
from celery import shared_task
from requests.exceptions import ConnectionError, Timeout

@shared_task(
    bind=True,
    autoretry_for=(ConnectionError, Timeout),  # only retry transient network errors
    max_retries=5,
)
def fetch_profile(self, user_id):
    resp = http_client.get(f"/users/{user_id}", timeout=10)
    resp.raise_for_status()
    return resp.json()

Be deliberate about which exceptions you list. A ValueError from a malformed payload will never succeed on retry — it is a poison message, and retrying it just wastes worker capacity. Reserve autoretry_for for errors that are genuinely transient (network, rate limits, lock contention).

Step 2 — Exponential backoff with jitter

Retrying immediately hammers a struggling downstream service and can turn a brief blip into a self-inflicted outage. retry_backoff enables exponential delay (2s, 4s, 8s, …), retry_backoff_max caps the ceiling, and retry_jitter randomizes the delay to prevent thundering-herd synchronization when many tasks fail at once.

# tasks.py
@shared_task(
    bind=True,
    autoretry_for=(ConnectionError, Timeout),
    retry_backoff=2,          # first retry waits ~2s, then 4s, 8s, 16s ...
    retry_backoff_max=600,    # never wait longer than 10 minutes between retries
    retry_jitter=True,        # randomize delay in [0, computed_delay] to spread load
    max_retries=8,
)
def sync_to_crm(self, record_id):
    crm.upsert(record_id)

With retry_jitter=True, the actual countdown is a random value between zero and the computed exponential delay, so 1,000 tasks that all failed against the same API at the same instant won't all retry in lockstep.

Step 3 — Fine-grained control with retry_kwargs and explicit self.retry()

When you need different behavior per call site, or want to attach a custom countdown computed at runtime, raise the retry explicitly. retry_kwargs sets defaults at the decorator level; an explicit self.retry() overrides them for a specific failure.

# tasks.py
@shared_task(
    bind=True,
    retry_kwargs={"max_retries": 3},  # default budget for this task
)
def charge_card(self, payment_id):
    try:
        gateway.charge(payment_id)
    except gateway.RateLimited as exc:
        # honor the gateway's Retry-After header instead of a fixed schedule
        raise self.retry(exc=exc, countdown=int(exc.retry_after))
    except gateway.CardDeclined as exc:
        # permanent business failure — do not retry, surface to caller
        raise exc

The distinction matters: RateLimited is transient and gets a server-directed countdown, while CardDeclined is terminal and propagates immediately rather than burning the retry budget.

Step 4 — Survive worker crashes with acks_late and reject_on_worker_lost

By default Celery acknowledges a message to the broker before the task runs. If the worker is killed mid-task (OOM, pod eviction, SIGKILL), that message is already gone and the work is lost. Setting task_acks_late=True defers the acknowledgment until the task finishes, so a crash leaves the message on the broker for redelivery. Pair it with task_reject_on_worker_lost=True so the message is requeued rather than silently dropped when the worker connection is severed.

# celeryconfig.py
task_acks_late = True               # ack only after the task completes
task_reject_on_worker_lost = True   # requeue (not discard) if the worker dies mid-task
worker_prefetch_multiplier = 1      # don't hoard messages a crashed worker can't finish

This trades at-most-once for at-least-once delivery — the task may now run twice if a worker dies after doing its work but before acking. That is exactly why Step 0's idempotency requirement is non-negotiable.

Step 5 — Run cleanup logic with on_failure and after_return hooks

When retries are finally exhausted, you want a single place to emit metrics, alert, and capture context. Subclass celery.Task and override on_failure, which fires after the last retry fails.

# base.py
from celery import Task

class TrackedTask(Task):
    def on_failure(self, exc, task_id, args, kwargs, einfo):
        # fires once, after the final retry has been exhausted
        metrics.increment("celery.task.failed", tags={"task": self.name})
        logger.error(
            "task permanently failed",
            extra={"task_id": task_id, "task": self.name, "args": args, "exc": repr(exc)},
        )
        super().on_failure(exc, task_id, args, kwargs, einfo)

    def on_retry(self, exc, task_id, args, kwargs, einfo):
        metrics.increment("celery.task.retried", tags={"task": self.name})
        super().on_retry(exc, task_id, args, kwargs, einfo)

Wire the base class into the decorator so every tracked task inherits the hooks:

# tasks.py
@shared_task(base=TrackedTask, bind=True, autoretry_for=(ConnectionError,), max_retries=5)
def deliver_webhook(self, endpoint, payload):
    http_client.post(endpoint, json=payload, timeout=15)

Step 6 — Route exhausted tasks to a dead-letter queue

Tasks that exhaust their retries should not vanish — they belong in a quarantine queue for inspection and replay. The cleanest approach with RabbitMQ is a broker-level dead-letter exchange, but you can also forward to a dedicated queue from the on_failure hook, which works with any broker including Redis. The concepts here are covered in depth under Dead-letter queues & poison messages.

First, declare a quarantine queue with RabbitMQ's dead-letter routing so messages that are rejected (not requeued) flow to it automatically:

# celeryconfig.py
from kombu import Queue, Exchange

task_queues = (
    Queue(
        "default",
        Exchange("default"),
        routing_key="default",
        queue_arguments={
            "x-dead-letter-exchange": "dlx",          # rejected messages go here
            "x-dead-letter-routing-key": "dead",
        },
    ),
    Queue("dead_letter", Exchange("dlx"), routing_key="dead"),
)

For an application-level path that is broker-agnostic, forward the failed payload from the failure hook to an explicit dead-letter task:

# base.py
class DLQTask(Task):
    def on_failure(self, exc, task_id, args, kwargs, einfo):
        quarantine_task.apply_async(
            kwargs={
                "original_task": self.name,
                "args": args,
                "kwargs": kwargs,
                "error": repr(exc),
                "traceback": str(einfo),
            },
            queue="dead_letter",
        )
        super().on_failure(exc, task_id, args, kwargs, einfo)

@shared_task(queue="dead_letter")
def quarantine_task(**failed):
    # persist for manual inspection / scheduled replay
    DeadLetter.objects.create(**failed)

Run a dedicated worker that consumes only dead_letter so quarantined work never competes with live traffic for capacity:

# start a low-concurrency worker drained off the dead-letter queue
celery -A myapp worker --queues=dead_letter --concurrency=1 --loglevel=INFO

Verification

Confirm the policy behaves as intended before trusting it in production.

Inspect the computed retry schedule and exception filter at runtime:

# python shell
from myapp.tasks import sync_to_crm
print(sync_to_crm.max_retries)            # -> 8
print(sync_to_crm.retry_backoff)          # -> 2
print(sync_to_crm.retry_jitter)           # -> True
print(sync_to_crm.autoretry_for)          # -> (ConnectionError, Timeout)

Force a transient failure and watch the eventing stream to see retries fire and the message land in quarantine:

# stream task events; look for task-retried then task-failed, then a dead_letter enqueue
celery -A myapp events --dump

Assert the dead-letter path in an integration test:

# tests/test_retry.py
def test_exhausted_task_is_quarantined(monkeypatch):
    monkeypatch.setattr("myapp.tasks.crm.upsert", _always_raises(ConnectionError))
    sync_to_crm.apply(args=[42])  # runs eagerly through all retries
    assert DeadLetter.objects.filter(original_task="myapp.tasks.sync_to_crm").exists()

Gotchas & edge cases

  • autoretry_for catches subclasses. Listing Exception will retry everything, including programming bugs and poison payloads, masking permanent failures as transient ones. List the narrowest exception classes that are actually recoverable.
  • max_retries=None means infinite. A poison message with no retry cap and no dead-letter exit will loop forever, consuming a worker slot indefinitely. Always set a finite max_retries and a dead-letter path together.
  • acks_late plus a long visibility_timeout masks crashes. With Redis as broker, a dead worker's message isn't redelivered until the visibility timeout elapses. Keep the timeout aligned with your p99 task duration or recovery will feel stuck.
  • on_failure runs only after the final retry. It does not fire on intermediate retries — use on_retry for per-attempt instrumentation. Counting failures in on_retry will over-report.
  • retry_backoff is ignored when you call self.retry(countdown=...). An explicit countdown wins, so don't expect the exponential schedule if your code overrides it per call.

Related