Streaming Updates through Complex Operations in Kafka Streams at Scale
Date : April 25, 2022
Time : 1:00 PM - 1:45 PM

With Kafka Streams, you can build complex stream processing topologies, including Joins and Aggregations over data streams. Making these complex processing pipelines aware of updates (including deletions) often becomes difficult, especially for previously joined and aggregated data. Because of the sheer amount of data, re-aggregating or re-joining from scratch at some time to handle updates correctly is not desirable. In practice, these updates play an important role in stream processing, for example, to continuously improve data quality, to ensure data privacy, or to handle late-arriving data. This talk explores how we efficiently handle these stream updates and deletions in consecutive joins with Kafka Streams. Furthermore, we present an optimization for the aggregate operation in Kafka Streams, leveraging state stores to handle updates in complex aggregates. We discuss the challenges we encountered running complex stream processing topologies on Kubernetes and explore the solution with hands-on experiments. Finally, we demonstrate how splitting the stream processing topologies enables us to have more fine granular control over resource allocation and scalability of different consecutive processing steps. We show how this improves cost-efficiency through autoscaling and the overall manageability of such streaming pipelines.

Victor Künstler
Software Engineer, bakdata GmbH