Kafka architecture—a deep dive

Kafka Streams

Kafka Streams is a client library for building applications and microservices where the input and output data are stored in Apache Kafka® clusters. Unlike many stream-processing systems, Kafka Streams is not a separate processing cluster but integrates directly within Java applications and standard microservices architectures. This makes it a highly accessible tool for developers working within the Kafka ecosystem.

Kafka Streams simplifies the development of highly scalable, fault-tolerant streaming applications. It bridges the gap between data ingestion and data processing in applications that process data in real time and at scale.

Summary of key features in Kafka Streams

ConceptDescription
Time windowing and aggregationSupports windowing operations to group data records by time intervals, essential for time-based aggregations and processing.
Stateful operationsMaintains local state stores for stateful processing without external dependencies.
Stream-table dualityUnifies stream and table concepts in one flexible data handling and processing framework.
Fault tolerance and scalabilityAutomatically handles failures and supports scaling out of the box, ensuring data is processed reliably and in a distributed manner.
Exactly-once processingGuarantees that each record will be processed exactly once, crucial for accurate computations and analytics.
Deployable anywhereKafka Streams applications can run anywhere—on a laptop, in a cloud, or a data center—without requiring a separate cluster.

We explore these concepts more in the rest of the article.

Core Kafka Streams concepts

Kafka Streams operates on a few fundamental concepts that form the basis of its architecture and functionality. Understanding these concepts is crucial for effectively leveraging the stream processing capabilities of Kafka Streams.

Streams

A stream in Kafka Streams is a sequence of data records, where each record represents a self-contained datum in an unbounded dataset. Streams are particularly useful for representing events or changes over time, enabling applications to process data continuously as it arrives.

Consider an IoT scenario where multiple sensors are deployed in a smart home environment. Each sensor sends readings about various parameters such as temperature, humidity, and light levels every few seconds, creating a continuous flow of data. These data points are sent to a Kafka topic as records. Each record in the stream might contain the sensor ID, the type of reading (e.g., temperature), the actual reading value, and a timestamp.

Tables

The stream represents a sequence of records over time. However, a table in Kafka Streams acts like a snapshot that captures each key's latest value at any moment. Think of it as a continuously updated record of the most current state of each key. It is analogous to a database table but is backed by Kafka topics that store the table's change log. This duality allows Kafka Streams to handle real-time data streams and the current state of data objects.

Consider an e-commerce platform where orders are continuously being placed, updated, and fulfilled. Each order update is a record in a stream, but you might want to track the latest status of each order in real time. A Kafka Streams table can keep track of the latest status as order updates stream in—such as orders placed, shipped, and delivered. The table updates the status of each order accordingly.

The two data abstractions, streams, and tables, enable complex processing operations such as joins, aggregations, and windowing.

Kafka Streams DSL

Kafka Streams DSL (Domain-Specific Language) is a high-level, declarative language built on top of the core Kafka Streams library. It provides a simpler, more readable way to define common stream processing operations like filtering, mapping, aggregation, and windowing.

Here’s a basic example of using the Streams DSL to count the number of events per type

// Starts building the stream processing topology.
StreamsBuilder builder = new StreamsBuilder();
// Reads events from the "event-stream" topic, with each event keyed // by a String and valued by an Event object.
KStream<String, Event> events = builder.stream("event-stream");
// Events are grouped by their type using event.getType(), and then // counted. This operation produces a KTable that continuously // updates as new events arrive.
KTable<String, Long> eventCounts = events
    .groupBy((key, event) -> event.getType())
    .count();
// Converts the KTable back to a KStream and writes the counts of each event type to the "event-counts-output" topic.
eventCounts.toStream().to("event-counts-output");

Processor API

The DSL provides readability for rapid development of straightforward tasks. However, it is less suitable for complex scenarios where detailed control over state or custom processing paths is needed.

Kafka Streams offers the Processor API for greater control and flexibility. This lower-level API allows developers to define and connect custom processors as needed. It supports custom state management and complex processing logic.

Here’s a brief snippet illustrating the creation of a custom processor:

class MyProcessor extends AbstractProcessor<String, String> {
    @Override
    public void process(String key, String value) {
        // Custom processing logic here
    }
}

It is important to note that the Processor API can lead to less maintainable code. It is best utilized for bespoke processing needs beyond DSL capabilities, such as tricky event-driven patterns or external system integration for custom state management.

Setting up Kafka Streams

Setting up a Kafka Streams application involves several key steps.

Step 1. Generate a Kafka Streams project

Begin by creating a new Java project in your preferred IDE. You can use Maven or Gradle as your build tool. Here’s a simple Maven setup:

<dependencies>
    <dependency>
        <groupId>org.apache.kafka</groupId>
        <artifactId>kafka-streams</artifactId>
        <version>2.8.0</version>
    </dependency>
</dependencies>

Ensure you include the Kafka Streams library in your pom.xml or build.gradle file. You might also need logging dependencies like SLF4J and Logback for better visibility during development.

Step 2. Write your first Streams application

Utilize the StreamsBuilder to define your processing topology. For instance, you could create a simple application that reads from one topic, performs a transformation, and writes to another topic:

// Initialize the processing topology for the application
StreamsBuilder builder = new StreamsBuilder();
// Create a stream from the 'input-topic'
KStream<String, String> source = builder.stream("input-topic");
// Transform each value to uppercase
KStream<String, String> transformed = source.mapValues(value -> value.toUpperCase());
// Send the transformed stream to 'output-topic'
transformed.to("output-topic");

Configure your application to connect to the Kafka cluster. Create a Properties object to specify various settings such as the application ID, bootstrap servers, key and value serializers, etc.

Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "my-streams-application");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());

Run your application. It will connect to Kafka and start processing data streams according to your defined topology.

Monitoring Kafka Streams

Monitoring is critical to ensure your stream processing applications perform well and are fault-tolerant.

Basic monitoring setup

Utilize JMX metrics exposed by Kafka Streams to monitor aspects like throughput, latency, and error rates. Set up monitoring dashboards using tools like Grafana or Prometheus to visualize these metrics.

Performance tips

Be mindful of the size and performance of Kafka Streams state stores, especially when dealing with large volumes of data. Ensure adequate resources (CPU, memory, network) are allocated based on the data volume and complexity of processing operations.

Optimal configuration

Adjust configurations like commit.interval.ms and cache.max.bytes.buffering to balance latency and throughput. Kafka Streams performs a repartition when you change the keys of the events/messages you are processing. Use repartition topics sparingly, as they can increase the load on the Kafka cluster.

As you become more familiar with Kafka Streams, you can fine-tune these configurations to better suit your specific use cases.

Advanced stream processing

As you become more proficient with Kafka Streams, you'll encounter more complex scenarios requiring advanced stream processing techniques.

Handling late data

Late data refers to records that arrive at the source topic later than expected, often due to delays in the data pipeline. The 'data pipeline' here refers to the sequence of systems and processes through which data passes before it reaches Kafka. The 'expected time window' is a predefined interval during which data records are anticipated to arrive based on their timestamp. This window is calculated based on the event time (when the event occurred) as recorded in the data itself.

For example, if an event is timestamped at 12:00 PM, and the window is set from 12:00 PM to 12:05 PM, any record timestamped at 12:00 PM is expected to arrive within this five-minute window. Delays in data transport or processing can cause records to arrive outside this window, classifying them as late data. Kafka Streams offers several strategies to manage this:

  • Tumbling Windows: Fixed-size, non-overlapping intervals that reset at the start of each window.
  • Hopping Windows: Fixed-size, overlapping windows that "hop" by a specified interval.
  • Sliding Windows: Windows that move with each new event, useful for handling late-arriving data.

You can also set a grace period on windowed computations to allow for late-arriving data before closing the window. This ensures that the data is processed even if it arrives late, within the defined grace period.

   TimeWindows.of(Duration.ofMinutes(5))
             .grace(Duration.ofMinutes(1));

Advanced windowing

Advanced windowing techniques in Kafka Streams allow for diverse and dynamic data characteristics.

Session windows

Session windows are dynamically sized and created based on periods of activity. They close when there is a timeout of inactivity, making them ideal for use cases like user session analysis.

Ideal for analyzing user activity sessions within an application. For instance, you can determine user engagement duration by tracking the start and end of interaction within an app.

KStream<String, UserInteraction> interactions = builder.stream("user-interactions");
KTable<Windowed<String>, Long> sessionDurations = interactions
    .groupByKey()
    .windowedBy(SessionWindows.with(Duration.ofMinutes(30))) // 30 minutes of inactivity ends the session
    .count();

Custom windowing

For scenarios that do not fit standard window types, Kafka Streams allows the development of custom windowing logic to handle unique processing requirements.

This is useful in scenarios where event patterns do not align with fixed or known intervals. For example, windowing is based on event characteristics like a sudden spike in data or a specific event marking the start or end of a window.

KStream<String, Event> events = builder.stream("event-stream");
Transformer<String, Event, KeyValue<String, Long>> customTransformer = new CustomWindowTransformer();
KStream<String, Long> processedEvents = events.transform(() -> customTransformer, "state-store-name");

State management

Understanding state management is crucial for leveraging the full capabilities of Kafka Streams, particularly when dealing with applications that require persistent states across processing steps.

State in Kafka Streams is defined through key-value stores tailored to the needs of your application. Developers can use the high-level DSL for common patterns or the Processor API for customized state handling. You can choose between persistent or in-memory storage options based on application requirements.

Local state stores within Kafka Streams are primarily backed by internal Kafka topics known as changelog topics. These topics utilize Kafka’s log compaction feature to retain only the latest value for each key. They facilitate efficient state recovery and maintenance.

The code snippet demonstrates how to define a state store within a Kafka Streams application and link it to a Kafka changelog topic for robust data handling and recovery.

StreamsBuilder builder = new StreamsBuilder();
KTable<String, String> table = builder.table(
    "input-topic", 
    Materialized.<String, String, KeyValueStore<Bytes, byte[]>>as("my-state-store")
        .withKeySerde(Serdes.String())
        .withValueSerde(Serdes.String())
);

Each application instance manages its own piece of the state. It updates and maintains the state, recording any changes in the changelog topic for consistency across application restarts or failures.

Kafka Streams supports state sharding and external state management strategies for applications with extensive state requirements. It integrates with external databases or other storage systems to maintain performance and manageability.

Interactive queries

Interactive queries extend Kafka Streams' capabilities by allowing it to act as a stream processor and a data store that you can query in real time.

Kafka Streams maintains a local state that you can query directly. It provides mechanisms to query remote states transparently for states distributed across multiple instances.

An e-commerce platform uses interactive queries to fetch real-time views of customer shopping carts or inventory levels. This enables instantaneous updates and interactions with customers.

ReadOnlyKeyValueStore<String, Long> keyValueStore =
       streams.store(StoreQueryParameters.fromNameAndType("cart-store", QueryableStoreTypes.keyValueStore()));
   Long itemCount = keyValueStore.get("customer-id-1234");

Best practices for performance optimization and scalability

Here are some Kafka Streams best practices:

State store management

Customize the configuration of RocksDB, the default state store used by Kafka Streams. Adjusting parameters like cache size and write buffer significantly improves performance. Consider using in-memory state stores for smaller datasets for faster access and lower latency.

Stream thread management

Increase the number of stream threads to parallelize processing. This can be done by setting the num.stream.threads property, allowing the application to process multiple partitions concurrently. Isolate critical streams or processes into separate threads to ensure heavy processing loads do not impact the entire application.

Serde optimization

Implement custom serialization and deserialization (Serde) for complex data types to reduce serialization overhead. Where possible, reuse Serde instances across the application to save on the overhead of Serde initialization.

Effective partitioning

Design your topics with scalability in mind. Choose a partition count that supports your throughput requirements without over-partitioning, which can lead to unnecessary resource usage. Use repartitioning carefully. While it allows for greater parallelism, it also introduces additional overhead and complexity in your data flow.

Resource allocation

Allocate sufficient CPU and memory resources based on your application’s performance profile. Monitoring tools can help identify bottlenecks and resource shortages. Utilize Kubernetes or other orchestration tools to dynamically scale your Kafka Streams applications based on load. This helps maintain performance during peak loads without over-provisioning resources.

Load balancing

Distribute load evenly across the Kafka Streams instances to prevent any single instance from becoming a bottleneck. Design your applications to handle failures gracefully. Use Kafka’s built-in fault tolerance mechanisms, such as replication and log compaction, to ensure data is not lost during a failure.

Practical challenges and solutions

Deploying Kafka Streams applications involves various challenges affecting performance, reliability, and maintainability.

Handling state growth

As state stores grow, they can consume substantial disk space and memory, decreasing performance and operational costs.

Implement strategies such as log compaction in Kafka to minimize disk usage and maintain only the latest version of each key. Periodically purge old or unnecessary data from state stores to manage size and improve access speed.

Rebalancing overheads

Frequent rebalancing can disrupt stream processing, especially in dynamic environments where clusters are scaled up or down regularly. Minimize rebalancing by stabilizing the number of partitions and stream threads and using stickier partition assignments to instances where possible.

Handling time skew in events

Events arriving out of order or with skewed timestamps can lead to incorrect processing results, especially when windowing operations are involved. Use appropriate time extraction and watermarking techniques to manage out-of-order data and ensure accurate window computations.

Monitoring for performance optimization

Regular monitoring of Kafka Streams applications is essential to maintain optimal performance.

Leverage Kafka’s built-in metrics to monitor lag, throughput, and processing latency. Implement detailed logging for critical paths in your application to help with debugging and performance tuning.

You can also set up dashboards using tools like Grafana to visualize performance metrics in real time. Configure alerts based on thresholds for critical metrics to ensure immediate response to potential issues.

A proactive approach to performance and resource management enhances your applications' reliability and optimizes operational costs.

Conclusion

Kafka Streams has its challenges that require workarounds and operational overheads. Apache Flink® is an alternative stream-processing framework that provides extensive capabilities for stateful computations across unbounded and bounded data streams. In the Flink vs. Kafka Streams debate, Flink excels in scenarios that require complex state management and event-time processing. It provides robust support for exactly-once semantics and complex event processing, making it suitable for applications that need high-level assurances and capabilities. Flink also supports a broader range of programming languages, including Java, Scala, and Python, offering greater flexibility than Kafka Streams, which is tied to the JVM (Java Virtual Machine).

Redpanda is a modern drop-in Kafka replacement that integrates with Flink to provide stateful stream processing alongside operational simplicity. Redpanda also uses WebAssembly to handle stream transformations, providing a sandboxed environment for stream processing logic to be executed safely and efficiently. This feature enhances security and performance, particularly when rapid, on-the-fly code changes are necessary.

To learn more, check out Redpanda’s capabilities or dive right into a free trial to take Redpanda for a spin.

Chapters