Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries
Date : September 14, 2021
Time : 04:00 PM - 05:00 PM

In a real-time data ingestion pipeline for analytical processing, efficient and fast data loading to a columnar database such as ClickHouse favors large blocks over individual rows. Therefore, applications often rely on some buffering mechanism such as Kafka to store data temporarily, and having a message processing engine to aggregate Kafka messages into large blocks which then get loaded to the backend database. Due to various failures in this pipeline, a naive block aggregator that forms blocks without additional measures, would cause data duplication or data loss. We have developed a solution to avoid these issues, thereby achieving exactly-once delivery from Kafka to ClickHouse. Our solution utilizes Kafka’s metadata to keep track of blocks that we intend to send to ClickHouse, and later uses this metadata information to deterministically re-produce ClickHouse blocks for re-tries in case of failures. The identical blocks are guaranteed to be deduplicated by ClickHouse. We have also developed a run-time verification tool that monitors Kafka’s internal metadata topic, and raises alerts when the required invariants for exactly-once delivery are violated. Our solution has been developed and deployed to the production clusters that span multiple datacenters at eBay.

Jun Li
Principal Architect, eBay