Understanding Apache Kafka

Kafka tutorial

“Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.”

- Kafka official documentation, https://kafka.apache.org/.

Apache Kafka® comes from LinkedIn, though it quickly became open source and a part of the Apache Software Foundation. Its concept of a distributed log system was straightforward, which made it beloved by many companies and led to a broad community growing around it.

Summary of Kafka features

Kafka is a powerful tool that relies on complex algorithms under the hood. Although newer solutions are emerging to compete with Kafka, it still boasts several desirable features.

Kafka capabilityDescription
High throughputKafka can ingest massive batches of data with a throughput similar to that of the older batch processing approach.
High availabilityKafka has built-in mechanisms to support customizing the replication factor along with at-least-once or at-most-once semantics out of the box.
ScalabilityYou can spin up an additional Kafka broker by simply specifying a ZooKeeper/KRaft quorum and starting a Kafka daemon.
Data persistence to disksKafka flushes data to disks and retains them for the specified time or size. You can replay your application data from any point.
Independent pipelinesYou may implement different approaches for the same data stream:- Store and forward the data, using it as intermediate storage or a buffer.- Implement a fan-in approach for data to flow to the same topic from multiple sources.- Implement a fan-out approach for data, where the same data set is consumed by multiple applications.
SimplicityKafka brokers focus on doing one thing well, avoiding feature creep, encouraging excellent performance, and preventing overcomplication.
MaturityKafka was first released more than 12 years ago and gathered a large community, with many Fortune 100 companies using it.
Open sourceKafka’s codebase is publicly available to be audited by anyone; you are free to fork the code to change it however you like.

Use-case scenarios

Kafka is helpful in various real-life operational and data analytics use cases.

  • Messaging: This domain has its own specialized software, like RabbitMQ and ActiveMQ, but Kafka is often sufficient to handle it while providing great performance.
  • Website activity tracking: Kafka can handle small, frequently generated data records like page views, user actions, and other web-based browsing activity.
  • Metrics: You can easily consolidate and aggregate data that can be sorted using topics.
  • Log aggregation: Kafka makes it possible to gather logs from different sources and aggregate them in one place in a single format.
  • Stream processing: Streaming pipelines are one of the most important Kafka features, making it possible to process and transform data in transit.
  • Event-driven architecture: Applications can publish and react to events asynchronously, allowing events in one part of your system to easily trigger behavior somewhere else. For example, a customer purchasing an item in your store can trigger inventory updates, shipping notices, etc.

Architectural components overview

These are the most essential high-level components of Kafka:

  • Record
  • Producer
  • Consumer
  • Broker
  • Topic
  • Partitioning
  • Replication
  • ZooKeeper or Controller Quorum

That list can feel like a bag of jargon when you’re new to Kafka or streaming. To make things a little easier, let’s break down each of those concepts one by one.

Record

Also called an event or message, a record is a byte array that can store any object of any format. An example would be a JSON record describing what link a user clicked while they were on your website.

Sometimes you want to distribute certain kinds of events among a group of consumers, so each event will be distributed to just one of the consumers in that group. Kafka allows you to define consumer groups this way.

A critical design approach is that, besides consumer groups, no other interconnection happens among clients. Producers and consumers are fully decoupled and agnostic of each other.

Producer

A producer is a client application that publishes records (writes) to Kafka. An example here is a JavaScript snippet on a website that tracks browsing behavior on the site and sends it to the Kafka cluster.

Consumer

A consumer is a client application that subscribes to records from Kafka (i.e. reads them), such as an application that receives browsing data and loads it into a data platform for analysis.

Broker

A broker is a server that handles producer and consumer requests from clients and keeps the data replicated within the cluster. In other words, a broker is one of the physical machines Kafka runs on.

Topic

A topic is a category that allows you to organize messages. Producers send to a topic, while consumers subscribe to topics of relevance, so they only see the records they actually care about.

Partitioning

Partitioning means breaking a topic log into multiple logs that can live on separate nodes on the Kafka cluster. This allows you to have topic logs that are too big to live on one single node.

Replication

Partitions can be copied among several brokers to stay safe in case one broker experiences a failure. These copies are called replicas.

Ensemble service

An ensemble is a centralized service for maintaining configuration information, discovery, and providing distributed synchronization and coordination. Kafka used to rely on Apache ZooKeeper for this, although newer versions have moved to a different consensus service called KRaft.

Not all event streaming software requires installing a separate ensemble service. Redpanda, which offers 100% Kafka-compatible data streaming, works out of the box because it already has this functionality built-in.

Learn to deploy your first Kafka cluster

The best way to learn is to start up Kafka and play around with producers, consumers, and other features. The other chapters of this guide will give you plenty of concepts to experiment with, but first, you’ll need to have Kafka installed and running.

Let’s deploy a simple single-node Kafka cluster. We will use an Ubuntu 22.04 LTS virtual machine with one CPU and 2 GB RAM.

1.Install the Java runtime environment, which is required to run Apache Kafka:

sudo apt install default-jre

2.Download your preferred version of the Kafka binaries, unarchive it (we will do it to the /opt directory), and change the current working directory:

wget https://archive.apache.org/dist/kafka/3.3.1/kafka_2.13-3.3.1.tgz
sudo tar -xzvf kafka_2.13-3.3.1.tgz -C /opt
cd /opt/kafka_2.13-3.3.1/

3.Start a ZooKeeper server with default configs config/zookeeper.properties:

sudo bin/zookeeper-server-start.sh config/zookeeper.properties &

4.Start a Kafka server with default configs config/server.properties:

sudo bin/kafka-server-start.sh config/server.properties &

5.We’re now running a single-node Kafka cluster! We can begin producing and consuming records:

bin/kafka-console-producer.sh --bootstrap-server localhost:9092 --topic myfirsttopic
>my first record in Kafka
>my second record in Kafka

6.Let’s consume the topic from the beginning:

bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --group mygroup --topic myfirsttopic --from-beginning
my first record in Kafka
my second record in Kafka

Perfect, both records were successfully sent from the producer to the consumer!

Kafka’s setup was a big contributor to its popularity. Now Redpanda takes it even further by only requiring that you download a single binary file and giving you an rpk, a command line tool for easily managing your clusters. You don’t need to worry about setting up ZooKeeper, KRaft quorum, HTTP Proxy, or Schema Registry. Those capabilities are all built into the Redpanda node! You can learn more about Redpanda's single binary architecture and its benefits here.

Chapters