
Introduction #
A growing number of companies are shifting their workload to cloud services to take advantage of their high scalability, security, availability, and performance. Amazon Web Services (AWS) and Google Cloud Platform (GCP), for example, can handle almost any use case, including object storage.
Object storage is a cost-effective data storage type that treats data as objects. It stores large amounts of unstructured data, generally bundled with metadata and a unique identifier. APIs provide easier data access and retrieval.
AWS Amazon Simple Storage Service (S3) and Google Cloud Storage (GCS) offer features such as versioning, replication, security or retention policies, and object lock. Redpanda integrates well with both. It is designed to be easily installed to get streaming up and running quickly.
In this tutorial, you will learn how to:
- Create an S3 and a GCS bucket
- Setup and run a Redpanda cluster, and create topics for Kafka Connect usage
- Configure and run a Kafka Connect cluster for Redpanda and both cloud storage providers, and use it to stream into their buckets
You can find the code to complete this demo in this GitHub repository.
Prerequisites #
You’ll need the following:
- An AWS account. You can create one here if you haven’t already.
- A GCP account. If you don’t have one, create a Google account and sign in.
- A machine to install Redpanda and Kafka Connect.
- Java 11 to run the producer application.
This tutorial uses a Linux system, but you can use any operating system or container services to install Redpanda. Check out Redpanda’s quick start documentation here for details.
Scenario: Streaming bookstore inventory records #
Here we’ve created a fictitious scenario to demonstrate how you can use Kafka Connect with Redpanda to feed data into S3 and GCS. This scenario is for demonstration purposes only, and does not necessarily reflect a typical use case.
Suppose that you work for a bookstore company called PandaBooks LLC as an integration engineer. It has two branches, one in London and the other in New York. The company has a main inventory file in CSV format that’s updated daily by the book providers, and the company manually splits and distributes the relevant inventory data to the branches.
This process has many drawbacks, though. It’s vulnerable to human error. The split CSV files for the branch inventories can’t be versioned, and no history is available, so data that’s lost can’t be recovered. Additionally, there are no file or storage standards.
The company decides to automate the process, distributing the inventories automatically and storing them in cloud environments in a standard format. Your job is to create a Kafka Connect cluster that you’ll configure to use a Redpanda cluster.
The London branch already uses AWS for their other daily processes such as keeping the order records in S3, and using Simple Email Service (SES) for customer emails. Because choosing a service that they already use lowers the costs for them, you must integrate the Kafka Connect cluster with S3 for the London inventory.
The New York branch is bigger and they have loads of customer data, such as payments, book borrowings, and online book orders. This branch uses GCP to include machine learning in their processes by using their data pool. Their cloud provider of choice will be GCP for any further implementation, so you must integrate the Kafka Connect cluster with GCS for the New York inventory.
The developers have created an inventory-distributor
application that produces the book data from the inventory file to relevant Kafka topics for bookstore branches. You must create the Kafka topics for each branch and then configure and create Kafka Connect connectors for each branch to consume the book data and save it to the relevant cloud storage.
The below image demonstrates what’s needed:

Setting up Redpanda #
Run the following commands to install Redpanda:
On Fedora/Red Hat systems:
Run the setup script to download and install the repo:
Use yum to install Redpanda:
On Debian/Ubuntu systems:
Run the setup script to download and install the repo:
Use apt to install Redpanda:
Start the Redpanda cluster with the following command:
Verify if it’s up and running:
You should see this output:
Keep in mind that this is a single-node Redpanda cluster and not suitable for a production environment. To install Redpanda in a production environment, check the production documentation.
To enable the inventory-distributor application and the Kafka Connect cluster to work properly, you must define the Kafka topics for the London and New York inventories. You can use the Redpanda command-line interface (CLI) to create topics on the Redpanda cluster. Access the CLI by running the rpk command.
Run the following command to create a topic for the London branch inventory:
Then, run this command for the New York branch inventory:
Verify that you have created the topics:
You will see the following output:
Don’t specify a partition count or replication factor for the topics because this isn’t a production environment.
Setting up the cloud storage providers #
Next, you’ll set up Amazon S3 and Google Cloud Storage.
Amazon S3
Log in to your AWS account and search for S3 in the search bar at the top. Click the S3 link and then the Create bucket button. Name the bucket pandabooks-london and select a region (eu-west-2 for London).
Scroll down to Bucket Versioning and enable it. Leave the other configurations as is and click Create bucket at the bottom of the page. If done correctly, you’ll see the following:

Next, create an AWS access key and secret for your account. Click your username on the top right and click the Security credentials link in the dropdown menu.
On the opened page, click the Access keys (access key ID and secret access key) section and then click the Create Access Key button. Save the access key ID and the secret key.
Google Cloud Storage (GCS)
Log in to your GCP account and search for storage in the search bar at the top. Click the Cloud Storage link. On the opened page, click the Create Bucket button and enter the name pandabooks-newyork.
Keep clicking Continue, leaving the rest of the data as is, until the section Choose how to protect object data. Select the Object versioning (best for data recovery) option to enable bucket versioning. Click the Create button.
You should see the following:

Create a service account and an access key. Search for a service account in the search bar at the top and click the Service Accounts link.
On the opened page, click the + Create Service Account button. Name the account gcs-storage-account and click the Create and Continue button. Select the Storage Admin role and click the Done button at the bottom.

On the redirected Service accounts page, click the three dots to open the Actions menu. Click the Manage keys link.
On the opened page, click the Add Key menu button and then click Create new key. Click Create on the pop-up page. This will trigger a download for the credentials JSON file. Copy the file to your home directory by renaming it to google_credentials.json.
Setting up Kafka Connect #
Kafka Connect is an integration tool released with the Apache KafkaⓇ project. It’s scalable and flexible, and it provides reliable data streaming between Apache Kafka and external systems. You can use it to integrate with any system, including databases, search indexes, and cloud storage providers. Redpanda is fully compatible with the Kafka API.
Kafka Connect uses source and sink connectors for integration. Source connectors stream data from an external system to Kafka, while sink connectors stream from Kafka to an external system.

You’ll need to download the Apache Kafka package to get Kafka Connect. Navigate to the Apache downloads page for Kafka and click the suggested download link for the Kafka 3.1.0 binary package.
Create a folder called pandabooks_integration in your home directory and extract the Kafka binaries file to this directory. You can use the following commands by changing the paths if necessary:
Configuring the Connect cluster
To run a Kafka Connect cluster, you’ll need to configure a file in the properties format.
In pandabooks_integration
, create a folder called configuration.
Create a connect.properties
file in this directory with the following content:
Set the bootstrap.servers
value to localhost:9092.
This configures the Connect cluster to use the Redpanda cluster.
Next, configure plugin.path
for the connector binaries. Create a folder called plugins
in the pandabooks_integration
directory and put the connector binaries in the plugins directory.
To download the S3 sink connector, navigate to the Aiven S3 sink connector for Apache Kafka download page and click the download link for v2.12.0. Use the following commands, changing the paths if necessary:
The final folder structure for pandabooks_integration should look like this:
Change the plugin.path
value to /home/_YOUR_USER_NAME_/pandabooks_integration/plugins.
This configures the Connect cluster to use the Redpanda cluster.
The final connect.properties
file should look like this:
Configuring the connectors
Setting up the connector plugins in a Kafka Connect cluster to integrate with external systems isn’t enough because the cluster needs the connectors configured for integration. You’ll need to configure the sink connectors for S3 and GCS.
To configure the S3 sink connector, create a file called s3-sink-connector.properties
in the ~/pandabooks_integration/configuration
directory with the following content:
Some of the values are already configured, but some are left blank. Set the following values for the keys in the s3-sink-connector.properties
file:
The London branch uses Amazon S3 for book inventory, so the above configuration specifies topics as london-inventory
and bucket name as pandabooks-london.
To configure the GCS sink connector, create a file called gcs-sink-connector.properties
in the ~/pandabooks_integration/configuration
directory with the following content:
Again, some of the values are already configured, but some are left blank. Set the following values for the keys in the gcs-sink-connector.properties
file:
The New York branch uses GCS for book inventory, so the above configuration specifies topics as newyork-inventory
and bucket name as pandabooks-newyork.
For both connectors, you configure the name of the file to be saved in the cloud storage as file.name.template={{key}}.json.
The file name must be the Kafka message key that the producer sets. The inventory-distributor
application is pre-configured to set the keys as the ISBN of the books.
You also set the format.output.fields
key with the value key,value. The files saved to cloud storage have the key and the value of the Kafka message.
Running the Kafka Connect cluster
To run the cluster with the configurations that you applied, open a new terminal window and navigate to the ~/pandabooks_integration/configuration
directory. Run the following command:
If done correctly, the output will look like this:
Note that you’re running the Kafka Connect cluster in standalone mode. Kafka Connect also offers a distributed mode, but standalone is a better fit for this architecture. For more on distributed mode, check Kafka’s documentation.
Running the inventory distributor #
Download the inventory-distributor
application binary by clicking this link in the ~/pandabooks_integration
directory.
Download the main inventory file book-inventory.csv
in the ~/pandabooks_integration
directory with the following command:
The book-inventory.csv
file will have six book records to be sent to the London and New York inventories, each tagged as either london or new york. The inventory-distributor
application uses these tags to sort the records.
Run the application with the following command in a new terminal window and keep it running:
The output should be as follows:
The inventory-distributor
application scans the changes in the book-inventory.csv
file continuously and sends them to the relevant Kafka topics for each inventory. Kafka Connect connectors consume the book data from these topics and save them to Amazon S3 for the London inventory and GCS for the New York inventory.
In your web browser, navigate to your S3 bucket pandabooks-london
to verify the objects are created. You’ll see three bucket objects created for each book data in the London inventory.

Click one of the object names and click the Download button to verify the file data. For example, the contents of the 978-0134052502.json
object should be as follows:
Now, do the same for your GCS bucket pandabooks-newyork.

This time, though, the book data 978-0553213690.json
is wrong. The Metamorphosis is in English, but the providers set it as German:
Open the book-inventory.csv
file in the ~/pandabooks_integration
directory and replace the German field with English, then save. Be sure that the inventory-distributor
application is still running.
In your web browser, verify that you have updated the 978-0553213690.json
object by downloading it.
It should have the following content:
Because you configured the connectors to send the files with the book ISBNs as the keys, the object names won’t change. S3 and GCS update the object data and keep the old version because you enabled versioning.
Now, add a book record in the book-inventory.csv
file for the London inventory. Add the following line in the book-inventory.csv
file:
It should appear in the inventory:

The content of the object should be the following:
Conclusion #
Congratulations! You’ve accomplished the bookstore’s project requirements. You’ve created storage buckets for your London and New York inventories, created a Redpanda cluster and topics for connector configurations, and created a Kafka Connect cluster to use Redpanda and stream book data to your inventory buckets.
By using Kafka Connect and Redpanda, you can help a variety of businesses to better manage and store their vital data. This improves your workflow as well as business operations.
Remember, you can find the code for this tutorial in this GitHub repository. Join Redpanda's Slack community to share what you build with Redpanda.
Let's keep in touch
Subscribe and never miss another blog post, announcement, or community event. We hate spam and will never sell your contact information.