Modern Kappa architecture: Redpanda, Materialize, and dbt

Realizing a faster, more developer-friendly, and safer Kappa architecture.

Patrick Angeles

February 18, 2022

CopIED!

Introduction

Ever since their ascent to dominance in the 80s, relational databases have been used in business applications to record transactions, store information on business objects (entities), and process and analyze historical data to inform business decisions. In this post, we will discuss established database patterns, and how they translate to a new streaming architecture that results in faster insights for modern business applications.

Operational vs analytical workloads

Database workloads are typically categorized as either operational or analytical. Operational workloads are about capturing and serving data instantaneously. For example, recording a transaction or updating a user profile. They tend to be write-heavy with random access patterns. Analytical workloads, on the other hand, are about processing and aggregating large amounts of historical data with the goal of answering ad-hoc queries. For example, a financial analyst might want to know a company’s “total sales per quarter over the last 3 years.” Analytical databases tend to be read-heavy, with high levels of concurrency.

Some applications could do all of this in a single database, typically using write-optimized (3NF) tables for operational data (transactions and entities), and denormalized tables—typically in a snowflake or star schema—for analytical data. Data would be incrementally transformed in batches from the operational to analytical tables with a SQL script, which could be invoked externally or via triggers and stored procedures. Several databases support materialized views, which is a convenient generalized implementation of this technique.

As applications grow in complexity and purpose, the single database design gets overwhelmed by the volume of data to store, process, and serve. At some point, we hit the physical limits of what a single database can do, and so the operational and analytical workloads are typically separated and placed on multiple databases.

Operational and analytical workloads are substantially different, and modern databases have evolved such that they are optimized for one type of workload. Operational databases tend to be row-oriented with transactional guarantees. These databases are designed for data correctness, durability, write concurrency, and random access patterns. Analytical databases, in turn, tend to be column-oriented and highly parallel. They optimize for concurrent range scans and aggregations typically required by reporting, dashboards, and business intelligence tools.

What is the ETL pattern?

To bridge the gap between operational and analytical databases, a common design pattern emerged. Applications would store and retrieve transaction and entity data using an operational database. On a periodic basis (typically once per day), the data would then be extracted, transformed, and loaded (ETL’d) into a data warehouse for use in reporting, dashboards, and business intelligence tools.

This architecture, while decades old, remains pervasive to this day. It allows for work to be designated to a specific database optimized to handle the load. However, this architecture also has its shortcomings, the most significant of which is data latency. The ETL process runs infrequently and often takes several hours to complete. Under this architecture, the resulting data for analysis is typically not available until the following day. This has the direct effect of delaying business decisions and dampening the ability to react to events as they happen.

Getting to real-time data streaming

In 2014, Martin Kleppman, author of the book Designing Data-Intensive Applications gave a seminal talk entitled Turning the database inside out. He outlined a new architectural pattern for processing data in real-time. The key insight is to externalize the write-ahead-log (WAL), which up to this point had been an internal (yet fundamental) component of every operational database. A streaming framework could then be used to process the log in near real-time, thereby providing an always-up-to-date equivalent to a materialized view.

This concept also extends to the ETL pattern. In a way, tables in a data warehouse are just materialized views that happen to live in another database. Instead of an ETL job processing on a daily cadence, you could have a streaming pipeline that updates your analytical views continuously as new data arrives.

Modernizing Kappa architecture with Redpanda, Materialize, and dbt

In recent years, Kleppman’s inside-out database concept has gained popularity, and even acquired a name: Kappa architecture. In his talk, Kleppman outlined a reference implementation using Apache Samza, a Java based streaming framework, and Apache Kafka® as the externalized WAL. Today, we believe we have an alternative implementation for the Kappa Architecture that is significantly faster, more developer friendly, has better safety guarantees, and provides better stewardship of the data. This new stack uses Redpanda for the WAL, Materialize for streaming SQL, and dbt for versioning and stewardship. Below, we discuss each piece of this stack in detail.

Redpanda: The fast, safe, and simple distributed log

Redpanda is a Raft-based event streaming platform with a distributed log written in C++ designed to take advantage of modern hardware. It runs at 10-40x lower latencies and 6x higher transaction throughput than Kafka, while maintaining API compatibility.

Redpanda does not use the page cache and guarantees full durability of logs upon acknowledgment of receipt. Raft-based consensus makes Redpanda fault-tolerant, performing well even with partial environmental failures. It is subjected to extensive chaos testing and is currently undergoing third-party certification to ensure correctness in the face of failure conditions. Redpanda has virtually infinite data retention with Tiered Storage. High throughput, low latency, fault tolerance, correctness, and infinite retention combine to make Redpanda a viable system of record, and the best choice for an externalized WAL.

Materialize: The streaming SQL engine

Materialize is a database purpose-built for streaming analytics. It incrementally updates query results (defined using SQL materialized views) as new data arrives, without requiring manual refreshes. The original reference implementation of the Kappa Architecture used Apache Samza, and since then multiple other streaming frameworks have popped up which require imperative programming skills in languages like Java, Scala, or Python. However, SQL is the lingua franca for batch data processing and analytics, so it makes sense to use it for streaming as well. Through its use of ANSI-standard SQL, Materialize puts the power of stream processing back into the hands of data engineers and analysts. Materialize is also wire-compatible with Postgres and can work with its broad ecosystem of tools and integrations.

Users can easily create Materialize sources and sinks that read from and write to Redpanda topics. Together, Redpanda and Materialize make a fast, easy-to-use implementation of the Kappa Architecture. And, as great as that may seem, the combination gets even better when we add dbt to the stack.

dbt: The data transformation framework

dbt is a tool quite unlike any other. In some respects, dbt does for data what Terraform does for building cloud infrastructure, or what Maven/Gradle/etc do for building code. dbt focuses on allowing data teams to work like software engineers, allowing them to incorporate best practices like version control, testing, and CI/CD to their analytics code. With dbt, users have a DRY approach to defining and executing their transformation logic. The logic is expressed in dbt models, which are written in SQL with the ability to supplement with Jinja.

Using functions like refs, dbt also automatically handles dependencies, making it easy to execute in different environments (e.g. development, production, or staging) and based on dependencies. It supports a number of data platforms like Postgres and Snowflake. dbt sits on top of these data platforms and executes SQL queries against them, pushing down the compute/storage.

Materialize is another supported execution environment, thanks to its wire compatibility with Postgres. With a standard database backend, dbt can perform periodic refreshes through the use of incremental models. Effectively this is like running mini-ETL jobs to update data. With dbt and Materialize, you can define your logic in dbt for Materialize to execute continuous real-time updates. No matter how frequently your data arrives, your models will stay up-to-date without manual or configured refreshes. As a transformation framework, dbt provides facilities for packaging, testing and documentation of models, pipelines, and schema. Especially when paired with a version control tool like Git, this combination gives data teams a powerful, self-documenting development stack for streaming data pipelines.

Get started today!

This is state-of-the-art Kappa architecture: Redpanda as a fast, durable log; Materialize for SQL-based streaming; and dbt for dataOps. This stack combines speed, ease of use, developer productivity, and governance. Best of all, you do not need to invest in setting up a large infrastructure: this entire stack can be packaged to run as a single Docker Compose project in your own laptop or workstation. You can try it out for yourself using this sample project.

For additional information or help with Redpanda, specifically, we encourage you to join our community Slack channel or download the binary from GitHub.