Apache Kafka use cases

Stream Processing in Apache Kafka

Graphic for downloading streaming data report

Watch On-Demand

Analyze or share this content:

CopIED!

Apache Kafka stream processing with Kafka Streams

The rise of the event-driven paradigm has transformed how software is designed. Applications and databases now often produce streams of events to indicate what is happening within them, with Apache Kafka^® being a popular broker for collecting and redistributing these events. But what is the most effective way to process a stream of events? How can you build pipelines that are both easy to implement and powerful enough to cover a wide range of use cases while leveraging the Kafka ecosystem?

The Kafka Streams framework, developed by the creators of Kafka, is a library designed specifically for event stream processing. It offers a declarative approach that simplifies the development of data-intensive applications.

This guide first introduces some relevant use cases for Kafka Streams and then explores the key concepts that enable their implementation within the Kafka Streams framework. It concludes with practical implementations using the Kafka Streams Java library.

Why use Kafka Streams?

Kafka itself is an open-source distributed event streaming platform, and Kafka Streams is a framework for building stream-processing applications that work on top of Kafka. It offers a declarative approach to creating pipelines that process messages and apply transformations such as filtering, aggregations, and joins.

This approach makes it easier for developers to build complex data processing applications without implementing common data manipulation operations. The abstraction layer offered by Kafka Streams aims to accelerate the development of event-driven applications and enhance code readability and maintainability.

*Kafka Streams application architecture overview*

Use cases for Kafka Streams

Kafka Streams offers powerful event stream processing capabilities that make it ideal for a wide range of use cases, including fraud detection, data cleansing, workflow automation, event-driven communication, enriching data streams, and real-time analytics.

Let’s explore some common Kafka Streams use cases.

Preprocessing

One of the primary use cases for Kafka Streams is data preprocessing. Typically, using Kafka Connect, the streaming process starts by ingesting data from various sources, such as databases (MongoDB, Postgres), third-party applications (Salesforce), or other messaging platforms (Amazon SNS, Google Pub/Sub), followed by transforming the data and publishing it to the sink.

Preprocessing messages requires validating or transforming the input messages so the other application can use the resulting data. While transforming the data, you can also fix or remove incorrect, duplicate, or irrelevant data to avoid polluting the source. This is especially important when moving data from an unstructured source like MongoDB and sending it to a system that imposes a strict schema.

*Data preprocessing and cleansing with Kafka Streams*

Event-driven communication and workflow automation

Kafka Streams enables event-driven architecture by transforming and routing data streams. It splits and reshapes data from one stream to multiple topics, ensuring proper formatting for each microservice. This approach prevents services from consuming irrelevant data and promotes better microservice decoupling.

For example, in an e-commerce system, an order stream can be split into separate streams for user notifications, delivery, and procurement. Each downstream application receives only the relevant, properly formatted data it needs, which improves overall system efficiency and maintainability.

*Event-driven communication workflow automation with Kafka Streams*

You can model the entire workflow of delivering goods to customers as a set of data streams and implement a Kafka Streams application(s) that acts as the controller orchestrating the data flow within your system.

Real-time analytics

Kafka Streams offers various processors to handle analytics, with common aggregation conditions such as count, min, max, and average, as well as time-based aggregation.

For example, you could calculate the average number of orders per hour and the total revenue for the day or identify trends in keyword searches. These insights can be forwarded downstream to enable real-time decision-making based on up-to-date metrics.

*Real-time analytics with Kafka Streams*

Enriching data streams

Kafka Streams can also effectively aggregate and merge streams to enable synchronous data enrichment that provides more comprehensive and meaningful insights. Without a system like Streams, you must capture changes and propagate them to other services to get a complete picture. For example, in an e-commerce website, you could combine streams of user activity and inventory levels to create a low-stock alert for your procurement system.

What’s stream processing with Apache Kafka?

Kafka Streams uses terminology borrowed from graph theory to model the flow of data within its application.

A Kafka Streams application defines and executes one or more “topologies,” which represent the architecture of your stream processing application. Each topology defines a graph that represents the flow of data within the application. The edges of this graph are called streams, and the nodes are called processors.

Within a topology, data flows from one or more source streams (corresponding to Kafka topics), traverses the graph while being transformed, and is finally published to sink streams, which also correspond to Kafka topics.

A processor is a function that performs computation based on its input streams and publishes the result to one or more output streams. Some processors need to maintain state between records; for example, a processor that counts messages needs to store the current total. So, some processors will have an associated store.

As you saw in the Kafka Streams use cases, preprocessing data typically involves transforming or filtering messages. For real-time monitoring, operations like counting and calculating attributes (minimum, maximum, or average values) are often required. These operations are supported out of the box by the Kafka Streams framework through its processors.

Kafka Streams processors can be grouped into three main categories: stateless, stateful, and time-based processors.

Stateless processors

Stateless processors are simple transformation functions that modify a data stream, where each message is processed independently from other messages. These are typically involved in data preprocessing, and they include:

map(), which processes a message and returns a new message to the output stream.
flatMap() and flatMapValues(), which process records and return one or more records to the output stream.

There are also some stateless processors dedicated to flow control, meaning they can be used to fan out or merge streams. Some examples include:

filter(), which selects messages that match a given predicate, ensuring only relevant messages are passed downstream.
branch(), which creates an array of output streams by sending the original message to an output stream matching a predicate. It can also duplicate a stream and perform multiple operations in parallel.

Stateful processors

A stateful processor needs to “persist” information from past events to process new ones. For instance, when counting, it needs to remember the previous total and increment it by one.

When a stateful processor is instantiated, Kafka creates a local state store, either in-memory or persistent. This store is managed entirely by the Kafka Streams framework and is fault-tolerant. All state changes are logged to a dedicated Kafka topic, enabling recovery of the latest state if the application crashes or restarts. Stateful processors are often used in real-time analytics use cases. Here are some examples:

count() tallies the total number of messages.
reduce() applies an aggregation function such as min, max, or avg to the stream of data.
join(), innerJoin(), and outerJoin() combine two streams together by joining the data on a given key and returning the joined value.

Timing and windowing processors

Real-time analytics often involves analyzing data over specific time intervals, such as performing hourly or daily aggregations. Windowing allows you to group records together based on their timestamp. These timestamps can either come from the messages themselves (event time) or from the application’s clock (processing time).

The Kafka Streams “time” processor handles messages as they arrive but doesn’t necessarily produce message results immediately. It buffers the incoming messages within each window and triggers the aggregation and output based on the windowing configuration. The framework offers three aggregation strategies:

Tumbling: Groups messages over a fixed window size, such as every hour or daily. There is no overlap in message aggregation.
Sliding: Groups messages over a fixed time window, sliding forward at a given frequency (for example, aggregating every hour with a five-minute slide).
‍Session: Groups messages as long as they are within a specified duration of one another (session timeout). The aggregation is triggered when the gap between messages exceeds the timeout, such as aggregating messages no more than thirty minutes apart.

*Windowing processor: tumbling vs. sliding vs. session*

Stream processing architecture

Let’s now explore how a Kafka broker and streaming application handle streams.

A Kafka Stream (or KStream) represents a list of key-value pairs called records. Each record in the stream corresponds to a message from the topic, with the key and value serving as the message’s content. In practice, the records originate from a Kafka topic that can be subdivided into partitions. Kafka guarantees message ordering for a given partition, meaning you can only process source streams in parallel if they have been partitioned.

Kafka Streams breaks down each topology into tasks based on the number of partitions for the input Kafka topics. Tasks can be executed in parallel, either by different instances of your application or different threads within an application. Roughly speaking, the number of tasks equals the number of partitions in the input topics.

*Kafka Streams application internal architecture*

Scaling a Kafka Streams application means either increasing the number of instances or increasing the size of the thread pools the Kafka Streams library manages. If you need to increase your application’s throughput, you must first increase the number of partitions in your source topic since the number of parallel tasks is directly linked to the number of partitions.

Within a topology, internal topics are used for processing in two cases:

Stateful operations: When an application needs a store to track stateful operations, Kafka Streams uses an internal changelog topic to persist changes. The state needed for processing is stored in a local state store managed by the Kafka Streams instance. Each state store is linked to a changelog topic, which records updates by sending a message to the topic whenever the state is modified.
‍Repartitioning data: Operations like groupBy or join require repartitioning the data. To ensure accurate aggregation (via count, aggregate, or reduce), all records with the same key (such as a group key or join key) must be placed in the same partition. This is especially necessary if the records were originally in separate partitions, ensuring that the follow-up operations yield valid results.

These internal topics guarantee that your streaming application is fault-tolerant. When an instance crashes or restarts, the local store is restored to its previous state by consuming the changelog topic and replaying the recorded operations. In an internal repartition topic, the topology is subdivided and processed by dedicated tasks, which allows aggregation operations after groupBy to be parallelized and distributed across multiple instances or threads.

Additionally, internal topics (changelog and repartition) allow you to define standby replicas. These replicas are deployed as backup instances that don’t process data but remain ready in case the primary application fails. Standby replicas subscribe to the changelog topic and maintain an up-to-date local store. If the streaming application fails, tasks are reassigned to the standby replica, which can immediately resume processing.

*Kafka Streams internal topics architecture*

Developing stream processing applications

Kafka Streams is a standalone framework, and all you really need to create a streaming application is org.apache.kafka:kafka-streams. But it’s better to use this library in conjunction with frameworks like Spring or Quarkus, as they provide integrations that handle the bootstrapping of the streaming application. This allows you to focus solely on defining the topologies by combining processors.

The following sections use an example application to illustrate how to use the framework. This example uses the Kafka Streams framework in combination with Spring Boot.

Stateless stream processing

As mentioned, stateless processing is ideal for preprocessing data. You can process messages one at a time, perform a transformation, and publish the result to one or more sinks.

Let’s walk through an example. Imagine you want to analyze search queries on an e-commerce website. The ideal data stream consists of a set of keywords you can derive metrics from, but user input may not always be in the desired format. Each query should be a valid set of nonrepeating English words with no special characters.

One possible approach to this task would be to first use the map() processor to clean up each event by removing special characters, handling potential typos, and eliminating duplicate words within the query. Then, you might apply the filter() processor to remove empty queries and send the processed stream to the sink topic for further analysis or storage. Here’s what this would look like:

public class KeywordSearchSearchTopology {

    public static final String SOURCE = "raw-search-queries";
    public static final String SINK = "search-queries";

    public void register(StreamsBuilder builder) {
        // 1. Consume search events from the input topic
        KStream<String, SearchEvent> searchStream = builder.stream(SOURCE,
                Consumed.with(Serdes.String(), Serdes.serdeFrom(new JsonSerializer<>(), new JsonDeserializer<>(SearchEvent.class))));

        // 2. Preprocess search events
        KStream<String, SearchEvent> preprocessedStream = searchStream
                // Clean up the data
                .map((key, value) -> {
                    String query = value.getQuery();

                    // a) Remove special characters and extra white space
                    query = query.replaceAll("[^a-zA-Z0-9\\s]", "").trim().replaceAll("\\s+", " ");

                    // b) Replace or discard unrecognized words (potential typos)
                    // ... (some magic dictionary or AI function)

                    // c) Remove duplicate words within the same query
                    List<String> uniqueWords = Arrays.stream(query.toLowerCase().split("\\s+"))
                            .distinct()
                            .collect(Collectors.toList());

                    value.setQuery(String.join(" ", uniqueWords));

                    return KeyValue.pair(key, value);
                })
                // Remove empty queries
                .filter((key, value) -> !Objects.equals(value.getQuery(), "")); 

                // 3. Publish the new cleansed stream
                preprocessedStream.to(SINK,
                Produced.with(
                        Serdes.String(),
                        Serdes.serdeFrom(new JsonSerializer<>(), new JsonDeserializer<>(SearchEvent.class))
                )
        );
    }
}

StreamsBuilder is a high-level Kafka Streams object that defines a topology. First, .stream(SOURCE, ...) defines the source stream, which specifies that it should read data from the Kafka topic raw-search-queries (defined in the SOURCE constant). Consumed.with(...) specifies how to deserialize the key and value of the incoming messages; in this case, the key is a string and the message is a JSON object. .to(SINK, ...) writes the data to the Kafka topic search-queries (defined in the SINK constant) and Produced.with(...) configures how to serialize the key and value of the outgoing messages.

In this case, Kafka Streams simplifies building a pipeline that cleans data and prepares it for analytics.

Stateful stream processing and time-based processing

After cleaning the user data and transforming it into the desired format using a stateless processor, you can move on to analyzing it.

Let’s aggregate the data to find the ranking of the most popular keywords during a given period of time. This operation involves working with stateful processors and time-based processors.

To achieve this with a chain of processors, you can use flatMapValues() to create a new stream where each event represents a single word. Then, you can apply groupBy() to group the keywords by their value and use the windowedBy() processor to organize the messages into ten-second windows. Finally, you can apply the count() processor to calculate the number of occurrences of each keyword within each time window. The following code demonstrates this approach:

public class KeywordSearchSearchTopology {

    public static final String SOURCE = "raw-search-queries";
    public static final String SINK = "search-queries";
    
    public static final String SINK_ANALYTICS = "search-query-analytics";
    public static final TimeWindows ANALYTICS_WINDOWS = TimeWindows.ofSizeWithNoGrace(Duration.ofSeconds(10));

    public static final String TRENDING_KEYWORD_STORE = "trending-keyword-store";

    public void register(StreamsBuilder builder) {
    
            // [Previous streams and processors]
            // 1. Consume search events from the input topic
            // 2. Preprocess search events
            // 3. Publish the new cleaned stream
            
        
        // 4. Extract keywords from search queries
        KStream<String, String> keywordStream = preprocessedStream.flatMapValues(searchEvent -> Arrays.asList(searchEvent.getQuery().toLowerCase().split("\\s+")));

        // 5. Identify trending keywords using a tumbling window
        KStream<Windowed<String>, Long> trendingKeywords = keywordStream
                .groupBy((key, keyword) -> keyword, Grouped.with(Serdes.String(), Serdes.String()))
                .windowedBy(ANALYTICS_WINDOWS)
                .count(Materialized.as(TRENDING_KEYWORD_STORE))
                .toStream()
                .filter((windowedKeyword, count) -> count >= 5); // Threshold for trending keywords

                // 5. Publish analytics to a new stream
        trendingKeywords.to(SINK_ANALYTICS, Produced.with(
                new WindowedSerdes.TimeWindowedSerde<>(Serdes.String(), Long.MAX_VALUE),
                Serdes.Long()));
                
    }

As you can see, the code for this operation is compact, but there’s a lot happening behind the scenes. Kafka Streams handles repartitioning, windowing, and the use of a local store. The best part is that none of this complexity needs to be manually managed because the Kafka Streams framework takes care of everything. To create a store, you only need to provide a name, such as the TRENDING_KEYWORD_STORE variable used here. By default, a local in-memory store is created, with a changelog topic backing it to ensure fault tolerance and data consistency.

Stateful processors open a new range of applications by exposing their local store through the interactive queries API. Interactive queries provide a mechanism for external applications to directly access the state managed by your Kafka Streams application; for example, this would allow other applications to query your processing application to access the current value of TRENDING_KEYWORD_STORE.

Unit and integration tests

The previous example demonstrated how to implement a Kafka Streams topology that processed search queries and turned them into analytics. But how can you test this topology and make sure the streaming logic is correct?

The best option for fast unit tests when using Kafka Streams is to rely on TopologyTestDriver. This test helper emulates a Kafka broker, thus eliminating the overhead of running a Kafka cluster for your unit tests. It still allows for testing complex use cases with multiple sources and sinks. Let’s illustrate this by processing a set of search queries and validating the result.

All unit tests for Kafka Streams follow a similar structure and are straightforward to implement. Before each test, you instantiate TopologyTestDriver, register your topology, and instantiate the required source and sink topics:

public class KeywordSearchSearchTopologyTests {

    private TopologyTestDriver testDriver;
    private TestInputTopic<String, SearchEvent> inputTopic;
    private TestOutputTopic<String, SearchEvent> outputTopic;
    private TestOutputTopic<Windowed<String>, Long> analiticsTopic;
    private ProductService productServiceMock; // Mock ProductService

    @BeforeEach
    public void setup() {
        // 1. Create a mock for ProductService
        productServiceMock = Mockito.mock(ProductService.class);

        // 2. Set up the topology
        StreamsBuilder builder = new StreamsBuilder();
        new KeywordSearchSearchTopology(productServiceMock).register(builder); // Inject the mock
        testDriver = new TopologyTestDriver(builder.build(), new Properties());

        // 3. Set up input and output topics
        inputTopic = testDriver.createInputTopic(
                KeywordSearchSearchTopology.SOURCE,
                new StringSerializer(),
                new JsonSerializer<>());
        outputTopic = testDriver.createOutputTopic(
                KeywordSearchSearchTopology.SINK,
                new StringDeserializer(),
                new JsonDeserializer<>(SearchEvent.class));
    }

        // Test cases [...]    
}

Once setup() is implemented, you can create individual tests in three steps. The following code first sends an array of events (searchEvents) to the input topic, then consumes the record from the output topics, and finally compares the produced message (trendingKeywords) with the list of expected messages (expectedResults):

   @Test
    public void testTrendingKeywords() {
        // 1. Define test data
        List<SearchEvent> searchEvents = Arrays.asList(
           // [...]
        );
        // Define the expected results with windowed keys and counts
        List<KeyValue<Windowed<String>, Long>> expectedResults = Arrays.asList(
           // [...]
       );

        // 2. Pipe input data to the input topic
        for (SearchEvent event : searchEvents) {
            inputTopic.pipeInput("key", event, Instant.now().toEpochMilli());
        }

        // 3. Verify trending keywords in the output topic
        List<KeyValue<Windowed<String>, Long>> trendingKeywords = analiticsTopic.readKeyValuesToList();
        assertIterableEquals(expectedResults, trendingKeywords);
    }
}

Using the unit test pattern illustrated above, you should be able to fully cover your code to validate your topologies.

You can push your test further and implement an integration test against a real Kafka cluster instead of TopologyTestDriver. To add an integration test, you could use the embedded Kafka broker, which is practical for many use cases. Alternatively, you can bootstrap a Kafka cluster yourself using Docker Compose or Testcontainers. While the embedded Kafka broker is convenient, the Docker setup offers the added benefit of interacting with the underlying Kafka cluster using the Kafka CLI to produce or consume messages.

Custom processors

Kafka Streams allows you to define custom processors to adjust the processing logic in your streaming application. Using a custom processor makes your code more modular and easier to maintain.

To create one, implement the Processor interface and override its process() method, which defines the operation to perform on each input record:

public class StringReverserProcessor implements Processor<String, String, String, String> {

    private ProcessorContext<String, String> context;

    @Override
    public void init(ProcessorContext<String, String> context) {
        this.context = context;
    }

    @Override
    public void process(Record<String, String> record) {
        // 1. Reverse the string
        String reversedString = new StringBuilder(record.value()).reverse().toString();

        // 2. Forward the reversed string with the same key
        context.forward(record.withValue(reversedString));
    }
}

This snippet defines a new processor with a process() method that handles records where both the key and value are strings, represented as Record<String, String>. The processor uses ProcessorContext to interact with the Kafka Streams framework, and calling .forward() sends the transformed record to the next processor in the topology graph.

You can use your newly created provider as follows:

public class StringReverserStreamsTopology {

    public static final String INPUT_TOPIC = "input-topic";
    public static final String OUTPUT_TOPIC = "output-topic";

    public void register(StreamsBuilder builder) {
        // 1. Consume from the input topic
        KStream<String, String> inputStream = builder.stream(INPUT_TOPIC, Consumed.with(Serdes.String(), Serdes.String()));

        // 2. Apply the StringReverserProcessor
        KStream<String, String> reversedStream = inputStream.process(StringReverserProcessor::new);

        // 3. Send the reversed strings to the output topic
        reversedStream.to(OUTPUT_TOPIC, Produced.with(Serdes.String(), Serdes.String()));
    }
}

Custom processors allow you to define and apply custom logic wherever needed.

Error handling

Kafka Streams comes with a set of predefined error handlers. DeserializationExceptionHandler tackles potentially invalid messages that couldn’t be deserialized, while ProductionExceptionHandler deals with other uncaught exceptions.

In most cases, you’ll choose among the preexisting implementations LogAndContinueExceptionHandler or LogAndFailExceptionHandler. A common use case for a custom error handler is implementing dead-lettering, which involves sending records that trigger exceptions to a quarantine topic. There, you can inspect them to understand the cause of the error. You can find an example implementation here.

Operating Kafka Streams applications

To finish this tour of the Kafka Streams framework, let’s talk about deployment, monitoring, and scalability. These elements ensure that Kafka Streams applications operate effectively in production environments and can handle the demands of data processing while maintaining performance and reliability as the system grows.

Deploy and monitor performances

You can deploy Kafka Streams applications like any other application. For instance, if you choose to use Kubernetes, you’ll need to package your application’s JAR as a container within a Docker image.

Kafka Streams exposes numerous metrics through JMX, including consumed/processed records, error rates, latency, and state store sizes. These JMX metrics are crucial for monitoring your application’s performance. It’s equally important to monitor the Kafka broker itself to ensure that the application’s throughput is sufficient to process messages in a timely manner. The combination of Prometheus and Grafana is quite popular for scraping and displaying these metrics. You can even find existing dashboards created by the community.

The key metrics for streaming applications are throughput (the number of records processed by your application per unit of time) and latency (the time it takes for a single record to be processed by a topology).

Adjust application for scalability

Given that the throughput of a Kafka application depends on the number of partitions in the source topic, consider increasing the partitions before scaling the application. When scaling, you have two options. First, you can increase the num.stream.threads configuration, which controls the number of threads. Alternatively, you can deploy new instances of your application by increasing the number of replicas in your Kubernetes deployment. The design of your topology also greatly impacts the performance of your application. If numerous operations are chained, you might need to use the through() processor to split your topology and better parallelize tasks. You can also use Subtopology to achieve the same effect.

In addition to scaling for improved throughput and cycle time, you also need to monitor the state store your application is using. Since state stores are stored in memory by default, your application should be overprovisioned in memory to avoid being OOM killed. If your application’s state store is large with many entries (rows), consider using the on-disk store that leverages RocksDB. But this is a trade-off; retrieving data from a disk store is slower than retrieving data from an in-memory store, and disk storage is much cheaper than in-memory storage.

With performance monitoring and scalability adjustments, you can ensure your Kafka Streams applications run smoothly and efficiently in production. Kafka-based applications require close attention to the Kafka broker when optimizing for performance and throughput.

Conclusion

Kafka Streams is a framework for building robust stream processing applications within the Kafka ecosystem. Kafka Streams simplifies the design of streaming applications through its extensive set of out-of-the-box processors. These processors abstract away the complexities of managing state and partitions during aggregation operations. By assembling these existing building blocks, Kafka Streams allows you to build virtually any data pipeline you can imagine.

If you’re impressed by Kafka Streams’ capabilities, consider exploring Redpanda—a modern, Kafka-compatible streaming platform engineered for low latency and high throughput. Redpanda serves as a drop-in replacement for Kafka, streamlining stream processing development and deployment while offering enhanced performance. It provides access to the familiar Kafka Streams API and ecosystem while benefiting from a more efficient streaming infrastructure.

To explore Redpanda and how it can make your job easier, check the documentation and browse the Redpanda blog for use cases and tutorials. If you have questions or want to chat with the team, join the Redpanda Community on Slack.

Chapters

Kafka use cases overview

Discover the diverse use cases of Apache Kafka in real-time data processing across industries like retail, gaming, healthcare, and finance.

Kafka metrics

Explore key Kafka performance metrics like cluster, consumer, and producer to optimize Kafka operations, scale clusters, and improve data streaming performance.

Kafka messaging system

Learn how the Apache Kafka messaging system enables real-time data exchange, scalable communication, and fault tolerance for efficient event streaming and analytics.

Website activity tracking

Discover how to implement real-time website activity tracking using Apache Kafka and Apicurio to better understand user behavior and optimize overall website performance.

Log aggregation

Discover real-time insights and streamline log aggregation with Apache Kafka and Redpanda for enhanced system performance, monitoring, and troubleshooting.

Stream processing

Learn how to build real-time event-driven applications with Kafka Streams. Explore use cases, key concepts, and how to optimize your stream processing architecture.

Event-driven architecture

Discover how Redpanda simplifies event-driven architecture with fast, reliable data streaming and easy management, offering a seamless alternative to Kafka.