Skip to main content

BullMQ Architecture

Queue Design​

To prevent "noisy neighbor" problems and ensure horizontal scalability, the system relies on distinct queues.

  • upload_jobs: Handles the process-upload jobs. A worker streams the raw file and breaks it down into chunks.
  • upload_chunks: Handles the process-chunk jobs. High concurrency (e.g., 50-100 workers) parses and validates the 1000-row chunks.
  • import_persistence: Handles the final persistence of verified contacts to PostgreSQL.

Worker Lifecycle​

Stalled Job Recovery​

BullMQ naturally tracks active locks.

  • Worker Heartbeat: Workers continually extend lock TTLs while parsing large chunks.
  • Abandoned Lock Recovery: If a worker node crashes (e.g., OOM), BullMQ detects the stalled job via Redis and automatically moves it back to the wait queue for retry ownership by a healthy worker.

Queue Backpressure & Load Shedding​

To prevent queue collapse and Redis memory explosion:

  • Queue Depth Alerts: Prometheus alerts trigger if waiting jobs exceed 50,000.
  • Upload Throttling: API Gateways reject new uploads (HTTP 429) if the system detects excessive queue lag.
  • Graceful Degradation: High-priority webhook queues and transactional delivery queues are entirely isolated from the heavy contact-import queues to ensure messaging latency is never impacted by a massive 1M-row contact import.

Dead Letter Queues & Retries​

  • Each queue has attempts: 3 with exponential backoff.
  • Jobs failing all retries are moved to a DeadLetterQueue row in PostgreSQL (not just Redis) for persistent tracking and manual inspection via the admin portal.

Outbox Pattern (DB → Redis Relay)​

The OUTBOX service mode (SERVICE_TYPE=OUTBOX) implements the Transactional Outbox Pattern to guarantee at-least-once message delivery with built-in multi-tenant isolation:

Key properties:

  • FOR UPDATE SKIP LOCKED prevents multiple Outbox relay instances from processing the same event.
  • The OutboxEvent table index [status, createdAt] and [status, vendorId, createdAt] are tuned for high-performance fair-share queries.
  • A PostgreSQL window function (CTE Partition) enforces that no single vendor's massive outbox stream can cause Head-of-Line (HoL) blocking for others. It pulls at most 50 events per vendor per cycle, up to a global batch limit of 500.
  • On Outbox relay crash, PENDING events remain in the DB and are picked up on restart — no data loss.

Multi-Tenant Virtualization & Isolation​

To support simultaneous heavy campaign launches by multiple vendors without starvation or resource starvation, we virtualization-isolate resource usage across tenants:

1. Fair-Share Campaign Scheduling​

The Campaigns Service poller runs every minute to find scheduled campaigns ready to run. Instead of executing campaigns sequentially (which would allow a single vendor's huge scheduled blast to monopolize execution threads and starve others), it implements a fair round-robin selection:

  • It fetches up to 50 ready campaigns from the database (QUEUED and scheduledAt <= now).
  • In memory, it groups the fetched campaigns by vendorId.
  • It executes a round-robin selection across the groups up to a global cycle limit of 10 campaigns, capping at most 2 campaigns per vendor per scheduler cycle.
  • This ensures time-sensitive scheduled campaigns from smaller vendors are executed promptly even under heavy system load.

2. Fair-Share Outbox Polling​

The Outbox Relay uses a Window Partition CTE query to select pending events:

  • It partitions pending events by vendorId (treating system/null vendor events individually using COALESCE("vendorId", id)).
  • It takes at most the first 50 events per vendor, capping the final batch at 500 events.
  • This guarantees that a vendor queuing 100,000 messages will not prevent other vendors' transactional messages from being polled and queued in BullMQ.

3. Redis-Based Worker Rate Limiting (Tenant Virtualization)​

Since native BullMQ group-based rate limiting requires BullMQ Pro, we enforce tenant-level rate limits in open-source BullMQ using a custom Redis-based sliding/fixed-window token bucket:

  • Limits: Shared standard workers cap execution at 10 requests/sec per vendor, while enterprise workers allow up to 50 requests/sec per vendor.
  • Mechanism: At the beginning of the process() method, the worker increments a Redis key specific to the vendor and current second: ratelimit:vendor:${vendorId}:${currentSecond} with a 2-second TTL.
  • Fair Yield: If the count exceeds the vendor's limit, the worker decrements the key (to maintain count accuracy) and reschedules the job 1 second in the future using await job.moveToDelayed(Date.now() + 1000, job.token).
  • Efficiency: Rescheduling immediately returns control back to the worker thread, allowing it to pull and process eligible jobs from other vendors rather than blocking.

Message Dispatch Queue​

Beyond contact import queues, BullMQ also powers the message dispatch pipeline:

QueuePurposeConcurrency
upload_jobsFile streaming + chunkingLow (I/O bound)
upload_chunksParse + validate + normalize rowsHigh (50–100 concurrent)
import_persistenceFinal DB upsert of valid contactsMedium
message_dispatchSend individual messages via Meta APIMedium (rate-limited)
message_retryRetry failed messages with backoffLow