Learn how to build a modern clickstream analytics system using Redpanda and SingleStore

ByManish KumarChris LarsenonJune 20, 2023
Unleashing the power of real-time analytics with SingleStore and Redpanda

The ability to use real-time data to make business decisions is critical in today’s world. In the modern business landscape, data has become the new oil, fueling growth and innovation across industries. Yet, as crucial as data is, its true value lies not only in the volume accumulated but in the speed at which we can process and analyze it effectively and efficiently to drive business outcomes.

The sheer magnitude of data generated every day requires tools that facilitate high-speed transactions, real-time analytics, and robust data streaming. Recognizing these requirements, two platforms stand out: SingleStoreDB and Redpanda. While each is powerful in its own right, integrating the two can revolutionize your data management strategy by bringing together the best of data streaming and real-time analytics.

Let's take an example of a clickstream data. Real-time clickstream analysis is a pervasive challenge in today's world of online interactions. It involves tracking and analyzing the sequence of clicks that a user makes while navigating a website or application, which is leveraged for improving user experiences, personalizing content and optimizing marketing strategies.

The complexity of dealing with high volume, velocity, and variety of clickstream data make its real-time processing extra challenging. Users typically use an event streaming platform like Kafka to process the clickstream, but have historically struggled with the end database on which they need to do complex analysis.

Building such a real-time analytics system requires both ingesting large amounts of events and serving the analytical needs of the application as quickly as possible. Such integrations involve plumbing multiple tools and custom solutions which makes it hard to manage and scale. This is where Redpanda’s integration with SingleStoreDB provides a best-in-class option for customers building such applications.

What's Redpanda?

If you're new here, Redpanda is a modern streaming platform that acts as a drop-in replacement for Apache Kafka®. Kafka was once a streaming data superpower, but it struggles with modern data-intensive requirements--sparking a need for leaner, faster streaming data alternatives.

Redpanda is designed to offer higher performance with a simpler, more developer-friendly architecture compared to Kafka. It's fully API compatible with Kafka, which makes it highly suitable for building such applications and also using existing Kafka applications without any code changes.

This makes it easy to integrate with SingleStore pipelines, which can ingest massive amounts of data into SingleStore and make it queryable in real time. Additionally, this can be done in a simple SQL-like interface, which makes using these configurations seamless for developers.

What's SingleStore?

SingleStore is a distributed, SQL database that unifies transactions and analytics in a single engine to drive low-latency access to large datasets, simplifying the development of fast, modern applications. Built for developers and architects, SingleStoreDB is based on a distributed SQL architecture, delivering 10-100 millisecond performance on complex queries, while ensuring businesses can effortlessly scale.

In this blog post, we will show how to set up a Redpanda cluster that receives clickstream data from a variety of sources. This data can then be ingested into a SingleStoreDB cluster running on AWS to provide rich insights in real time.

The challenges of building a modern clickstream analytics system

Building a modern clickstream analytics system involves collecting, processing, analyzing, and storing large amounts of data in real time. Let's take an example of the user operation doing multiple clicks online. All of the clickstream data needs to be processed to drive business value or predictive analytics. At a high level, it involves four steps:

  1. Collect data from variety of sources
  2. Ingest data
  3. Store the data
  4. Drive business insights by consuming the data

All of these operations need to happen in real time while ensuring the architecture is robust for enterprise readiness, scalable based on business needs, and at a reasonable cost.

While using real-time stream processing systems such as Apache Flink and Apache Spark can provide a low latency solution, quick ingestion into a database, and quick processing-—Redpanda's integration with SingleStore provides the best parts of both tools.

Typical databases struggle with the speed of ingestion and have to rely on external tools. However, SingleStore supports native capability called Pipelines which help in super-fast ingestion from Kafka.

Build a real-time analytics system in three steps

Now we'll show a simple setup of ingesting a large amount of clickstream data from Redpanda to SingleStoreDB cloud running on AWS.


Clickstream Kafka pipelines to SingleStore

Redpanda is easy to deploy in the cloud using one of two options: Dedicated Cloud (provisioned in Redpanda’s tenant, AWS in this case) or Bring Your Own Cloud (BYOC - provisioned in your tenant yet still fully managed with Redpanda’s unified control plane). The solution in this tutorial was built using Redpanda’s BYOC model.

To build the connection with SingleStoreDB Cloud, customers can set up a Kafka. Now that you have SingleStoreDB cluster running, we'll create a pipeline that can capture the incoming stream of data natively into SingleStore. All you need is three steps:

1 - Set up the actual pipeline using SingleStore Kafka pipelines.

CREATE OR REPLACE PIPELINE `<Pipeline_name> ` AS LOAD DATA KAFKA 'Redpanda_topic_1, Redpanda_topic_2, Redpanda_topic_3, CONFIG '{ "sasl.username": "<user_name> ", "sasl.mechanism": "SCRAM-SHA-256", "security.protocol": "SASL_SSL" }' CREDENTIALS '{ "sasl.password": "REDACTED" }' DISABLE OUT_OF_ORDER OPTIMIZATION INTO TABLE <table_name> FORMAT JSON ( field_1<- value_1, field_2<- value_2, field_3<- value_3, field_4<- value_4, field_5<- value_5, ) ON DUPLICATE KEY UPDATE field_1= VALUES(value_1), field_2= VALUES(value_2), field_3= VALUES(value_3), field_4= VALUES(value_4), field_5= VALUES(value_5), ;

2 - Once we have the pipeline created you can check the pipeline by checking the sample data

> TEST PIPELINE `<Pipeline_name> `

3 - Once you have verified that the pipeline works perfectly fine through the sample data, you can start the pipeline and see that the data should start flowing.

> START PIPELINE `<Pipeline_name> `

Just like that, you can start getting the data into SingleStore, which is instantly available for queries and provide quick insights.

Empower your real-time data strategy

The combination of SingleStore and Redpanda offers a best-in-class solution for organizations seeking real-time analytics and high-speed data processing. By harnessing the power of these platforms, businesses can stay ahead in today’s data-driven landscape. In this step-by-step blog, we demonstrated how to set up the connection between Redpanda and SingleStoreDB running on AWS.

Redpanda provides a real-time ingestion platform that can be a drop-in replacement for Kafka. SingleStore provides unique capability to solve both transactional and analytical needs of your application and makes data available for query as soon as it's loaded. This makes building real-time applications simple and efficient—a perfect choice for storing real-time data, streamed from Redpanda.

Interested in trying out SingleStore and Redpanda? Learn how to get started building with SingleStoreDB and try Redpanda for free. You can also browse the Redpanda blog for step-by-step tutorials and real-world customer stories. If you have questions or just want to chat with fellow Redpanda users, join the Redpanda Community on Slack.

Additional resources

Let's keep in touch

Subscribe and never miss another blog post, announcement, or community event. We hate spam and will never sell your contact information.