Building a BullMQ Grafana dashboard

This guide is part of Grafana dashboards for queues within the broader topic of observability for job queues, and walks through getting BullMQ metrics into Grafana from scratch.

BullMQ exposes rich queue state — waiting, active, completed, failed, delayed — through its Redis-backed API, but none of it reaches Grafana without an explicit export step. The result is a familiar blind spot: you can see jobs in redis-cli but you have no time-series of queue depth, no completed-versus-failed trend, and no duration histogram to spot a slow regression. This page builds the full path: a Prometheus exporter from your BullMQ app, a scrape config, a Grafana dashboard with the panels that actually matter, and an alert on failure rate.

Prerequisites

  • A Node.js service running BullMQ 4.x+ against Redis. If you are still configuring it, see configuring BullMQ concurrency limits for high throughput.
  • A running Prometheus (v2.x) and Grafana (v9+).
  • prom-client available to your Node app (npm install prom-client).
  • The queue name(s) you want to chart and network access from Prometheus to your app's metrics port.
  • Grafana admin rights to import a dashboard and create an alert rule.

Step 1 — Export BullMQ metrics from the Node app

BullMQ's Queue.getJobCounts() returns the live counts you need; prom-client turns them into Prometheus gauges. Expose a /metrics route and poll the queue on a short interval.

// metrics.js — expose BullMQ counts and a duration histogram to Prometheus
const express = require("express");
const client = require("prom-client");
const { Queue, QueueEvents } = require("bullmq");

const connection = { host: "redis", port: 6379 };
const queueName = "transcode";
const queue = new Queue(queueName, { connection });
const events = new QueueEvents(queueName, { connection });

const register = new client.Registry();
client.collectDefaultMetrics({ register });

const depth = new client.Gauge({
  name: "bullmq_queue_jobs",
  help: "Job counts by state",
  labelNames: ["queue", "state"],
  registers: [register],
});
const completed = new client.Counter({
  name: "bullmq_jobs_completed_total",
  help: "Completed jobs", labelNames: ["queue"], registers: [register],
});
const failed = new client.Counter({
  name: "bullmq_jobs_failed_total",
  help: "Failed jobs", labelNames: ["queue"], registers: [register],
});
const duration = new client.Histogram({
  name: "bullmq_job_duration_seconds",
  help: "Job processing time",
  labelNames: ["queue"],
  buckets: [0.1, 0.5, 1, 2.5, 5, 10, 30, 60, 120],
  registers: [register],
});

// Poll queue counts every 5s -> gauges
async function refreshCounts() {
  const counts = await queue.getJobCounts(
    "waiting", "active", "completed", "failed", "delayed"
  );
  for (const [state, value] of Object.entries(counts)) {
    depth.labels(queueName, state).set(value);
  }
}
setInterval(() => refreshCounts().catch(console.error), 5000);

// Event-driven counters + duration
events.on("completed", ({ jobId }) => completed.labels(queueName).inc());
events.on("failed", ({ jobId }) => failed.labels(queueName).inc());
// Record duration from processedOn/finishedOn inside your worker:
//   duration.labels(queueName).observe((job.finishedOn - job.processedOn) / 1000);

const app = express();
app.get("/metrics", async (_req, res) => {
  res.set("Content-Type", register.contentType);
  res.end(await register.metrics());
});
app.listen(9700, () => console.log("metrics on :9700"));

Record the histogram inside the worker where you have the job object, using job.finishedOn - job.processedOn; the gauge poll handles the depth counts.

Step 2 — Scrape the exporter with Prometheus

Add the Node service as a target. A 15s interval matches the 5-second poll closely enough to see depth changes promptly.

# prometheus.yml
scrape_configs:
  - job_name: "bullmq"
    scrape_interval: 15s
    static_configs:
      - targets: ["transcode-worker:9700"]
        labels:
          service: "media-pipeline"
curl -X POST http://prometheus:9090/-/reload   # apply without restart

Step 3 — Validate the queries Grafana will use

Build and test each panel query in the Prometheus expression browser first. A panel that shows "No data" is almost always a bad query, not a bad dashboard.

# Queue depth by state (stacked time series)
bullmq_queue_jobs{queue="transcode"}

# Completed vs failed throughput, jobs/sec
rate(bullmq_jobs_completed_total{queue="transcode"}[5m])
rate(bullmq_jobs_failed_total{queue="transcode"}[5m])

# p95 job duration from the histogram
histogram_quantile(0.95,
  sum by (le) (rate(bullmq_job_duration_seconds_bucket{queue="transcode"}[5m])))

# Failure ratio for the stat/alert panel
rate(bullmq_jobs_failed_total{queue="transcode"}[5m])
  / clamp_min(rate(bullmq_jobs_completed_total{queue="transcode"}[5m])
            + rate(bullmq_jobs_failed_total{queue="transcode"}[5m]), 0.001)

Step 4 — Import the dashboard JSON

Rather than building panels by hand, import a JSON model. This snippet defines the depth, throughput, and p95-duration panels; paste it into Grafana via Dashboards → New → Import.

{
  "title": "BullMQ — transcode",
  "templating": { "list": [
    { "name": "queue", "type": "query", "datasource": "Prometheus",
      "query": "label_values(bullmq_queue_jobs, queue)" }
  ]},
  "panels": [
    { "type": "timeseries", "title": "Queue depth by state",
      "gridPos": {"h":8,"w":12,"x":0,"y":0},
      "targets": [{ "expr": "bullmq_queue_jobs{queue=\"$queue\"}",
                    "legendFormat": "{{state}}" }] },
    { "type": "timeseries", "title": "Completed vs failed (/s)",
      "gridPos": {"h":8,"w":12,"x":12,"y":0},
      "targets": [
        { "expr": "rate(bullmq_jobs_completed_total{queue=\"$queue\"}[5m])",
          "legendFormat": "completed" },
        { "expr": "rate(bullmq_jobs_failed_total{queue=\"$queue\"}[5m])",
          "legendFormat": "failed" }] },
    { "type": "timeseries", "title": "p95 job duration (s)",
      "gridPos": {"h":8,"w":24,"x":0,"y":8},
      "targets": [{ "expr": "histogram_quantile(0.95, sum by (le) (rate(bullmq_job_duration_seconds_bucket{queue=\"$queue\"}[5m])))",
                    "legendFormat": "p95" }] }
  ],
  "schemaVersion": 39
}

The $queue template variable lets one dashboard serve every queue — pick the queue from the dropdown instead of cloning the dashboard per queue.

Step 5 — Add a failure-rate alert in Grafana

Attach a Grafana-managed alert so a rising failure ratio pages you without leaving Grafana. This is the single most valuable alert for a job pipeline.

# Grafana unified alerting rule (provisioning format)
apiVersion: 1
groups:
  - orgId: 1
    name: bullmq-alerts
    folder: queues
    interval: 1m
    rules:
      - uid: bullmq_failure_ratio
        title: "BullMQ failure ratio > 5%"
        condition: C
        data:
          - refId: A
            datasourceUid: prometheus
            model:
              expr: |
                rate(bullmq_jobs_failed_total{queue="transcode"}[5m])
                  / clamp_min(rate(bullmq_jobs_completed_total{queue="transcode"}[5m])
                            + rate(bullmq_jobs_failed_total{queue="transcode"}[5m]), 0.001)
          - refId: C
            datasourceUid: __expr__
            model: { type: threshold, expression: A,
                     conditions: [{ evaluator: { type: gt, params: [0.05] } }] }
        for: 10m
        labels: { severity: warning }
        annotations:
          summary: "transcode failure ratio above 5% for 10m"

Verification

Confirm each stage produces data before relying on the dashboard.

# 1. Exporter is serving BullMQ series
curl -s http://transcode-worker:9700/metrics | grep bullmq_

# 2. Prometheus is scraping the target
curl -sG http://prometheus:9090/api/v1/query \
  --data-urlencode 'query=up{job="bullmq"}' | python -m json.tool   # value -> "1"

# 3. Panels render — push a few jobs, then watch depth move
#    In Grafana, the "Queue depth by state" panel should show waiting/active climb.

A healthy up{job="bullmq"}, non-empty bullmq_queue_jobs, and panels that move when you enqueue jobs confirm the dashboard is wired correctly.

Gotchas and edge cases

Counts go stale when the poller dies. bullmq_queue_jobs is set by an interval timer; if that timer throws and stops, the gauge freezes at its last value and the dashboard lies. Wrap refreshCounts() in a try/catch (as shown), and add a freshness alert on time() - timestamp(bullmq_queue_jobs) > 60.

Histograms with le aggregated away. histogram_quantile needs the le label preserved. If you sum without (le) or rate without keeping le, the quantile returns NaN. Always sum by (le) before histogram_quantile.

One exporter per queue causes label clashes. If multiple worker processes each export the same queue's counts, Prometheus sees duplicate series and the depth panel shows the max or flickers. Export queue counts from a single dedicated metrics process, and let workers export only their own duration observations.

Counter resets on deploy. bullmq_jobs_completed_total resets when the Node process restarts on deploy. Always chart it with rate(); a raw counter panel will show a misleading sawtooth at every deploy.

Related