Ingesta

High-throughput data ingestion pipeline with reliable batch processing and DLQ.

February 10, 2024· 7 min read

Problem

Ingesting high-volume event streams into a relational database is error-prone when individual record failures cause entire batch failures. Retry logic without backoff causes thundering herd problems under load.

Solution

Built a Kafka consumer pipeline with configurable batch sizes, transactional batch inserts, dead-letter queue routing for failed records, and exponential backoff retry.

Design

Ingesta is a Python service that consumes from Kafka topics and writes to PostgreSQL in batches.

Batch Strategy

Each consumer thread accumulates records up to a configurable BATCH_SIZE or BATCH_TIMEOUT_MS, whichever comes first. Batches are inserted in a single multi-row INSERT.

Dead-Letter Queue

Records failing schema validation or DB insertion after MAX_RETRIES are routed to a DLQ topic with an error envelope containing the original payload, error type, and timestamp.

Offset Management

Kafka offsets are committed manually after successful batch insertion, ensuring at-least-once delivery.

Deployment

Packaged as a Docker container configured via environment variables. A docker-compose.yml is provided for local development with Kafka and PostgreSQL.