Understanding Apache Kafkaguide hero background, gray with some polygons

Kafka tutorial

“Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.”

- Kafka official documentation, https://kafka.apache.org/.

Apache Kafka® comes from LinkedIn, though it quickly became open source and a part of the Apache Software Foundation. Its concept of a distributed log system was straightforward, which made it beloved by many companies and led to a broad community growing around it.

Summary of Kafka features

Although Kafka is a powerful tool that relies on complex algorithms under the hood. Although newer solutions are emerging to compete with Kafka, it still boasts several desirable features.

Kafka capability

Description

High throughput

Kafka can ingest massive batches of data with a throughput similar to that of the older batch processing approach.

High availability

Kafka has built-in mechanisms to support customizing the replication factor along with at-least-once or at-most-once semantics out of the box.

Scalability

You can spin up an additional Kafka broker by simply specifying a ZooKeeper/KRaft quorum and starting a Kafka daemon.

Data persistence to disks

Kafka flushes data to disks and retains them for the specified time or size. You can replay your application data from any point.

Independent pipelines

You may implement different approaches for the same data stream:

- Store and forward the data, using it as intermediate storage or a buffer.

- Implement a fan-in approach for data to flow to the same topic from multiple sources.

- Implement a fan-out approach for data, where the same data set is consumed by multiple applications.

Simplicity

Kafka brokers focus on doing one thing well, avoiding feature creep, encouraging excellent performance, and preventing overcomplication.

Maturity

Kafka was first released more than 12 years ago and gathered a large community, with many Fortune 100 companies using it.

Open source

Kafka’s codebase is publicly available to be audited by anyone; you are free to fork the code to change it however you like.

Use-case scenarios

Kafka is helpful in various real-life operational and data analytics use cases.

  • Messaging: This domain has its own specialized software, like RabbitMQ and ActiveMQ, but Kafka is often sufficient to handle it while providing great performance.

  • Website activity tracking: Kafka can handle small, frequently generated data records like page views, user actions, and other web-based browsing activity.

  • Metrics: You can easily consolidate and aggregate data that can be sorted using topics.

  • Log aggregation: Kafka makes it possible to gather logs from different sources and aggregate them in one place in a single format.

  • Stream processing: Streaming pipelines are one of the most important Kafka features, making it possible to process and transform data in transit.

  • Event-driven architecture: Applications can publish and react to events asynchronously, allowing events in one part of your system to easily trigger behavior somewhere else. For example, a customer purchasing an item in your store can trigger inventory updates, shipping notices, etc.

Architectural components overview

These are the most essential high-level components of Kafka:

  • Record

  • Producer

  • Consumer

  • Broker

  • Topic

  • Partitioning

  • Replication

  • ZooKeeper or Controller Quorum

That list can feel like a bag of jargon when you’re new to Kafka or streaming. To make things a little easier, let’s break down each of those concepts one by one.

Record

Also called an event or message, a record is a byte array that can store any object of any format. An example would be a JSON record describing what link a user clicked while they were on your website.

Sometimes you want to distribute certain kinds of events among a group of consumers, so each event will be distributed to just one of the consumers in that group. Kafka allows you to define consumer groups this way.

A critical design approach is that, besides consumer groups, no other interconnection happens among clients. Producers and consumers are fully decoupled and agnostic of each other.

Producer

A producer is a client application that publishes records (writes) to Kafka. An example here is a JavaScript snippet on a website that tracks browsing behavior on the site and sends it to the Kafka cluster.

Consumer

A consumer is a client application that subscribes to records from Kafka (i.e. reads them), such as an application that receives browsing data and loads it into a data platform for analysis.

Broker

A broker is a server that handles producer and consumer requests from clients and keeps the data replicated within the cluster. In other words, a broker is one of the physical machines Kafka runs on.

Topic

A topic is a category that allows you to organize messages. Producers send to a topic, while consumers subscribe to topics of relevance, so they only see the records they actually care about.

Partitioning

Partitioning means breaking a topic log into multiple logs that can live on separate nodes on the Kafka cluster. This allows you to have topic logs that are too big to live on one single node.

Replication

Partitions can be copied among several brokers to stay safe in case one broker experiences a failure. These copies are called replicas.

Ensemble service

An ensemble is a centralized service for maintaining configuration information, discovery, and providing distributed synchronization and coordination. Kafka used to rely on Apache ZooKeeper for this, although newer versions have moved to a different consensus service called KRaft.

Not all event streaming software requires installing a separate ensemble service. Redpanda, which offers 100% Kafka-compatible data streaming, works out of the box because it already has this functionality built-in.

Learn to deploy your first Kafka cluster

The best way to learn is to start up Kafka and play around with producers, consumers, and other features. The other chapters of this guide will give you plenty of concepts to experiment with, but first, you’ll need to have Kafka installed and running.

Let’s deploy a simple single-node Kafka cluster. We will use an Ubuntu 22.04 LTS virtual machine with one CPU and 2 GB RAM.

  1. Install the Java runtime environment, which is required to run Apache Kafka:

    sudo apt install default-jre
  2. Download your preferred version of the Kafka binaries, unarchive it (we will do it to the /opt directory), and change the current working directory:

    wget https://archive.apache.org/dist/kafka/3.3.1/kafka_2.13-3.3.1.tgz
    sudo tar -xzvf kafka_2.13-3.3.1.tgz -C /opt
    cd /opt/kafka_2.13-3.3.1/
    

  3. Start a ZooKeeper server with default configs config/zookeeper.properties:

    sudo bin/zookeeper-server-start.sh config/zookeeper.properties &

  4. Start a Kafka server with default configs config/server.properties:

    sudo bin/kafka-server-start.sh config/server.properties &

  5. We’re now running a single-node Kafka cluster! We can begin producing and consuming records:

    bin/kafka-console-producer.sh --bootstrap-server localhost:9092 --topic myfirsttopic
    >my first record in Kafka
    >my second record in Kafka
    

  6. Let’s consume the topic from the beginning:

    bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --group mygroup --topic myfirsttopic --from-beginning
    my first record in Kafka
    my second record in Kafka

Perfect, both records were successfully sent from the producer to the consumer!

Kafka’s setup was a big contributor to its popularity. Now Redpanda takes it even further by only requiring that you download a single binary file and giving you an rpk, a command line tool for easily managing your clusters. You don’t need to worry about setting up ZooKeeper, KRaft quorum, HTTP Proxy, or Schema Registry. Those capabilities are all built into the Redpanda node! You can learn more about Redpanda's single binary architecture and its benefits here.

Chapters

Kafka tutorial

Kafka makes it easy to stream and organize data between the applications that produce and consume events. However, using Kafka optimally requires some expert insights like the kind we share in this series of chapters on Kafka.

Kafka console producer

Kafka offers a versatile command line interface, including the ability to create a producer that sends data via the console.

Kafka console consumer

Kafka makes it easy to consume data using the console. We’ll guide you through using this tool and show you how it is used in real-world applications.

Kafka without ZooKeeper

New changes are coming that allow engineers to use Kafka without relying on ZooKeeper. Learn all about how KRaft makes ZooKeeper-less Kafka possible in this article.

Kafka partition strategy

Learn how to select the optimal partition strategy for your use case, and understand the pros and cons of different Kafka partitioning strategies.

Kafka consumer config

Consumers are a basic element of Kafka. But to get the most out of Kafka, you’ll want to understand how to optimally configure consumers and avoid common pitfalls.

Kafka schema registry

Figuring out the format used by a producer can be quite a chore. Luckily, Kafka offers the schema registry to give us an easy way to identify and use the format specified by the producer.

Streaming ETL

ETL presents a variety of challenges for data engineers, and adding real-time data into the mix only complicates the situation further. In this article, we will help you understand how streaming ETL works, when to use it, and how to get the most out of it.

RabbitMQ vs. Kafka

In the world of distributed messaging, RabbitMQ and Kafka are two of the most popular options available. But which one is the better choice for your organization? Read on to find out in this head-to-head comparison.

Kafka cheat sheet

Kafka is a powerful tool, but navigating its command line interface can be daunting, especially for new users. This cheat sheet will guide you through the most fundamental commands and help you understand how they work.

ETL pipeline

Learn how to build a near real-time streaming ETL pipeline with Apache Kafka and avoid common mistakes.

What is Kafka Connect?

Learn how to build and run data pipelines between Apache Kafka and other data systems with Kafka Connect, including configuring workers, connectors, tasks, and transformations.