BullMQ Architecture
Queue Design​
To prevent "noisy neighbor" problems and ensure horizontal scalability, the system relies on distinct queues.
upload_jobs: Handles theprocess-uploadjobs. A worker streams the raw file and breaks it down into chunks.upload_chunks: Handles theprocess-chunkjobs. High concurrency (e.g., 50-100 workers) parses and validates the 1000-row chunks.import_persistence: Handles the final persistence of verified contacts to PostgreSQL.
Worker Lifecycle​
Stalled Job Recovery​
BullMQ naturally tracks active locks.
- Worker Heartbeat: Workers continually extend lock TTLs while parsing large chunks.
- Abandoned Lock Recovery: If a worker node crashes (e.g., OOM), BullMQ detects the stalled job via Redis and automatically moves it back to the
waitqueue for retry ownership by a healthy worker.
Queue Backpressure & Load Shedding​
To prevent queue collapse and Redis memory explosion:
- Queue Depth Alerts: Prometheus alerts trigger if
waitingjobs exceed 50,000. - Upload Throttling: API Gateways reject new uploads (HTTP 429) if the system detects excessive queue lag.
- Graceful Degradation: High-priority webhook queues and transactional delivery queues are entirely isolated from the heavy contact-import queues to ensure messaging latency is never impacted by a massive 1M-row contact import.
Dead Letter Queues & Retries​
- Each queue has
attempts: 3with exponential backoff. - Jobs failing all retries are moved to a
DeadLetterQueuerow in PostgreSQL (not just Redis) for persistent tracking and manual inspection via the admin portal.
Outbox Pattern (DB → Redis Relay)​
The OUTBOX service mode (SERVICE_TYPE=OUTBOX) implements the Transactional Outbox Pattern to guarantee at-least-once message delivery with built-in multi-tenant isolation:
Key properties:
FOR UPDATE SKIP LOCKEDprevents multiple Outbox relay instances from processing the same event.- The
OutboxEventtable index[status, createdAt]and[status, vendorId, createdAt]are tuned for high-performance fair-share queries. - A PostgreSQL window function (CTE Partition) enforces that no single vendor's massive outbox stream can cause Head-of-Line (HoL) blocking for others. It pulls at most 50 events per vendor per cycle, up to a global batch limit of 500.
- On Outbox relay crash, PENDING events remain in the DB and are picked up on restart — no data loss.
Multi-Tenant Virtualization & Isolation​
To support simultaneous heavy campaign launches by multiple vendors without starvation or resource starvation, we virtualization-isolate resource usage across tenants:
1. Fair-Share Campaign Scheduling​
The Campaigns Service poller runs every minute to find scheduled campaigns ready to run. Instead of executing campaigns sequentially (which would allow a single vendor's huge scheduled blast to monopolize execution threads and starve others), it implements a fair round-robin selection:
- It fetches up to 50 ready campaigns from the database (
QUEUEDandscheduledAt <= now). - In memory, it groups the fetched campaigns by
vendorId. - It executes a round-robin selection across the groups up to a global cycle limit of 10 campaigns, capping at most 2 campaigns per vendor per scheduler cycle.
- This ensures time-sensitive scheduled campaigns from smaller vendors are executed promptly even under heavy system load.
2. Fair-Share Outbox Polling​
The Outbox Relay uses a Window Partition CTE query to select pending events:
- It partitions pending events by
vendorId(treating system/null vendor events individually usingCOALESCE("vendorId", id)). - It takes at most the first 50 events per vendor, capping the final batch at 500 events.
- This guarantees that a vendor queuing 100,000 messages will not prevent other vendors' transactional messages from being polled and queued in BullMQ.
3. Redis-Based Worker Rate Limiting (Tenant Virtualization)​
Since native BullMQ group-based rate limiting requires BullMQ Pro, we enforce tenant-level rate limits in open-source BullMQ using a custom Redis-based sliding/fixed-window token bucket:
- Limits: Shared standard workers cap execution at 10 requests/sec per vendor, while enterprise workers allow up to 50 requests/sec per vendor.
- Mechanism: At the beginning of the
process()method, the worker increments a Redis key specific to the vendor and current second:ratelimit:vendor:${vendorId}:${currentSecond}with a 2-second TTL. - Fair Yield: If the count exceeds the vendor's limit, the worker decrements the key (to maintain count accuracy) and reschedules the job 1 second in the future using
await job.moveToDelayed(Date.now() + 1000, job.token). - Efficiency: Rescheduling immediately returns control back to the worker thread, allowing it to pull and process eligible jobs from other vendors rather than blocking.
Message Dispatch Queue​
Beyond contact import queues, BullMQ also powers the message dispatch pipeline:
| Queue | Purpose | Concurrency |
|---|---|---|
upload_jobs | File streaming + chunking | Low (I/O bound) |
upload_chunks | Parse + validate + normalize rows | High (50–100 concurrent) |
import_persistence | Final DB upsert of valid contacts | Medium |
message_dispatch | Send individual messages via Meta API | Medium (rate-limited) |
message_retry | Retry failed messages with backoff | Low |