Resilience & Failure Handling
Failures are expected at scale. The system categorizes and handles them strictly.
Chunk-Level Retries​
If a Redis connection drops mid-chunk or the DB deadlocks:
- BullMQ catches the exception and retries the
process-chunkjob. - Because
ContactImportChunksaves therawPayload, the worker is entirely stateless and idempotent.
Partial Success (Non-Fatal Errors)​
If a user uploads 100,000 rows, and 1,000 are impossible phone numbers:
- The parser logs 1,000 Validation Errors.
- The 99,000 valid rows are kept.
- The overall Job Status becomes
PENDING_REVIEW(orPARTIAL_SUCCESS). - The user can download the 1,000 error rows, fix them, and upload just those rows again.
Worker Crash Recovery​
If an OOM (Out of Memory) kills a chunk worker:
- BullMQ's stalled job detection mechanism realizes the worker died.
- The job is moved back to the active queue and given to a healthy worker.
DB Row Locking​
During the "Confirm" step (where valid contacts are moved to the actual DB):
- Deduplication relies on
INSERT ... ON CONFLICT DO NOTHING. - This avoids complex
SELECT FOR UPDATEdeadlocks on 100,000 rows while ensuring multi-tenant consistency.