Skip to main content

Resilience & Failure Handling

Failures are expected at scale. The system categorizes and handles them strictly.

Chunk-Level Retries​

If a Redis connection drops mid-chunk or the DB deadlocks:

  • BullMQ catches the exception and retries the process-chunk job.
  • Because ContactImportChunk saves the rawPayload, the worker is entirely stateless and idempotent.

Partial Success (Non-Fatal Errors)​

If a user uploads 100,000 rows, and 1,000 are impossible phone numbers:

  • The parser logs 1,000 Validation Errors.
  • The 99,000 valid rows are kept.
  • The overall Job Status becomes PENDING_REVIEW (or PARTIAL_SUCCESS).
  • The user can download the 1,000 error rows, fix them, and upload just those rows again.

Worker Crash Recovery​

If an OOM (Out of Memory) kills a chunk worker:

  • BullMQ's stalled job detection mechanism realizes the worker died.
  • The job is moved back to the active queue and given to a healthy worker.

DB Row Locking​

During the "Confirm" step (where valid contacts are moved to the actual DB):

  • Deduplication relies on INSERT ... ON CONFLICT DO NOTHING.
  • This avoids complex SELECT FOR UPDATE deadlocks on 100,000 rows while ensuring multi-tenant consistency.