Vinija's Notes • Infrastructure • Kafka Streams

Overview

Overview

Kafka Streams is a powerful client library for building applications and microservices, where the input and output data are stored in Kafka clusters. It enables real-time processing of data streams with simplicity, scalability, and fault tolerance.

What is Kafka Streams?

Kafka Streams is designed to process and analyze data stored in Kafka. It allows developers to build sophisticated stream processing applications that can transform, aggregate, and enrich data in real-time. By leveraging Kafka’s robust messaging capabilities, Kafka Streams ensures high throughput and low latency for stream processing.

How Kafka Streams Works

Kafka Streams applications are standard Java applications that leverage Kafka’s producer and consumer APIs. Here’s a high-level overview of how it works:

Stream Topology: A Kafka Streams application is defined by a topology of stream processors, which are connected by streams of data. The topology describes how data flows and is processed from input topics to output topics.
Stream Processing: Each processor in the topology can perform transformations, aggregations, joins, and other operations on the streaming data. The processed data is then forwarded to downstream processors or output topics.
Parallel Processing: Kafka Streams automatically parallelizes the processing by dividing the data into partitions and assigning them to different instances of the application.

State Management

Kafka Streams supports stateful stream processing, which is crucial for operations like aggregations, joins, and windowing. It manages state using embedded key-value stores:

Local State Stores: Each instance of a Kafka Streams application maintains its local state store. These stores are replicated across multiple instances to ensure fault tolerance.
RocksDB: By default, Kafka Streams uses RocksDB as the underlying storage for state stores. This provides efficient persistence and retrieval of state.

Event Handling

Kafka Streams processes events from Kafka topics in a continuous and scalable manner:

Record Streams and Tables: Kafka Streams treats input topics as either streams (a sequence of immutable events) or tables (a changelog of updates).
Event Processing: Events are processed as they arrive. Kafka Streams provides powerful operators like map, filter, join, and aggregate to transform and process these events.
Windowing: For time-based operations, Kafka Streams supports windowing, allowing aggregation of events within specific time intervals (e.g., tumbling windows, sliding windows).

Differences from Kafka Messaging

While both Kafka Messaging (using Kafka producers and consumers) and Kafka Streams operate on Kafka topics, there are key differences:

Purpose: Kafka Messaging is used for general-purpose messaging between producers and consumers, while Kafka Streams is specifically designed for stream processing and real-time analytics.
State Management: Kafka Messaging does not manage state, whereas Kafka Streams provides built-in support for stateful processing using local state stores.
Processing Semantics: Kafka Streams offers exactly-once processing semantics, ensuring that each event is processed once and only once. In contrast, Kafka Messaging typically provides at-least-once semantics.
Stream Processing API: Kafka Streams provides a high-level API for defining stream processing topologies, which simplifies the development of complex stream processing logic. Kafka Messaging uses the lower-level producer and consumer APIs.

Feature	Apache Flink	Amazon Kinesis	Kafka Streams
Primary Use Case	General-purpose stream and batch processing	Real-time data ingestion and processing	Stream processing with Kafka integration
Data Processing	Supports both stream and batch processing	Real-time processing, data ingestion, and analytics	Stream processing on Kafka data
State Management	Built-in state management with keyed and operator state	No native state management (requires external storage)	Embedded state stores with RocksDB
Fault Tolerance	Exactly-once semantics with distributed snapshots	Data replication across AZs, at-least-once semantics	Exactly-once semantics, state replication
Scalability	Horizontal scaling with task managers	Shard-based architecture, automatic scaling	Scales with Kafka partitions, parallel processing
Latency	Low latency with fine-grained control	Low latency for ingestion and processing	Low latency, tightly integrated with Kafka
Integration	Integrates with various data sources and sinks	Integrates with AWS services like S3, Redshift, Lambda	Tight integration with Kafka ecosystem
Deployment	Requires setup and management of Flink clusters	Fully managed service by AWS	Embedded in Java applications, no need for separate cluster
Ease of Use	Requires setup and operational management	Easy to use with AWS management	Easy to use for Kafka users, embedded in applications
Complex Event Processing	Advanced support with CEP library	Basic support through Kinesis Data Analytics	Supports complex event processing with Kafka Streams API
Windowing and Time Semantics	Rich windowing capabilities with event-time processing	Basic windowing support in Kinesis Data Analytics	Supports windowing, joins, and aggregations
Streaming SQL	Flink SQL for stream processing	Kinesis Data Analytics with SQL support	Interactive Queries with Kafka Streams
Community and Ecosystem	Large open-source community and broad ecosystem	Part of the AWS ecosystem	Part of the Kafka ecosystem

Easy

What is Apache Kafka Streams?
- Apache Kafka Streams is a client library for building applications and microservices, where the input and output data are stored in Kafka clusters. It allows for processing and transforming data stored in Kafka topics.
What are the main components of Kafka Streams?
- The main components include the Kafka Streams API, Stream Processor, KStream, KTable, and GlobalKTable.
What is a KStream in Kafka Streams?
- A KStream is an abstraction for a stream of records where each record is a key-value pair. It represents a continuous flow of data.
What is a KTable in Kafka Streams?
- A KTable is an abstraction for a changelog stream from a primary-keyed table. It represents a snapshot of the latest value for each key in a stream.
How do you start a Kafka Streams application?
- You start a Kafka Streams application by defining a topology of processing nodes, creating a StreamsBuilder object, building the topology, and starting the KafkaStreams instance.

Medium

What is a topology in Kafka Streams?
- A topology in Kafka Streams is a directed acyclic graph (DAG) of stream processing nodes. Each node represents a processing step and the edges represent the data flow between these steps.
Explain the difference between KStream and KTable.
- KStream represents a stream of records, and every data change is recorded. KTable represents a table of the latest values for each key and records changes as updates.
How does Kafka Streams ensure fault tolerance?
- Kafka Streams ensures fault tolerance through replication of state stores, use of Kafka’s consumer group protocol for failover, and changelogs for restoring state stores from Kafka topics.
What is stateful processing in Kafka Streams?
- Stateful processing involves operations that require maintaining state across records, such as aggregations, joins, and windowed computations. Kafka Streams supports stateful processing with state stores.
What are stream-table joins in Kafka Streams?
- Stream-table joins allow joining a KStream with a KTable to combine streaming data with static data. This enables enriched, contextual streaming data processing.

Hard

Describe how windowed joins work in Kafka Streams.
- Windowed joins allow joining two KStreams or a KStream and a KTable based on a window of time. This means only records that arrive within the specified time window are joined. There are different windowing types like tumbling, hopping, and session windows.
How does Kafka Streams handle exactly-once semantics?
- Kafka Streams handles exactly-once semantics using idempotent producers and transactional writes to Kafka. It ensures that each record is processed exactly once even in the case of retries or failures.
What is a GlobalKTable and when would you use it?
- A GlobalKTable is a KTable that is fully replicated on each Kafka Streams instance. It is used when you need to perform joins where each instance needs to have access to the full table data.
Explain the concept of rebalancing in Kafka Streams.
- Rebalancing in Kafka Streams occurs when there is a change in the number of instances or the number of input partitions. During rebalancing, partitions are reassigned to ensure load balancing across instances. This can impact processing as state stores might need to be transferred.
How do you implement custom processors in Kafka Streams?
- Custom processors in Kafka Streams can be implemented by extending the AbstractProcessor class and overriding the process method. You then add this processor to your topology using the Topology.addProcessor method.