KAFKA SUMMIT EUROPE

May 11 - 12, 2021

Stream Data Deduplication Powered by Kafka Streams

Date : May 12, 2021

Time : 01:30 PM - 02:00 PM

Representations of data, e.g., describing news, persons or places, differ. Therefore, we need to identify duplicates, for example, if we want to stream deduplicated news from different sources into a sentiment classifier. We built a system that collects data from different sources in a streaming fashion, aligns them to a global schema and then detects duplicates within the data stream without time window constraints. The challenge is not only to process newly published data without significant delay, but also to reprocess hundreds of millions existing messages, for example, after improving the similarity measure. In this talk, we present our implementation for deduplication of data streams built on top of Kafka Streams. For this, we leverage Kafka APIs, namely state stores, and also use Kubernetes to auto-scale our application from 0 to a defined maximum. This allows us to process live data immediately and also reprocess all data from scratch within a reasonable amount of time.

Speakers

Philipp Schirmer

Software Engineer, bakdata GmbH

Privacy Policy | Terms & Conditions,
Apache, Apache Kafka, Kafka, Apache Flink, Flink and associated open source project names are trademarks of the Apache Software Foundation.
The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event
Copyright © Confluent, Inc. 2016 - 2024

#kafkasummit