Redpanda is up to 6x more cost-effective than Apache Kafka, and 10x faster.
In this post, we will explore the overall costs of running both an Apache Kafka® and a Redpanda cluster for real-world data streaming use cases and throughputs in a self-hosted deployment model. We will define a cost model, test the physical characteristics of both systems using representative configurations, including security and disaster recovery (DR), and then evaluate the administrative costs for both systems.
Defining a cost model for data streaming
In this economic climate, costs are top of mind for everyone. It’s a good time to put the money in your savings account instead of putting it in the hands of the cloud vendors! Total Cost of Ownership (TCO) should be a primary consideration when evaluating the Return on Investment (ROI) of adopting a new software platform. TCO is the blended cost of deploying, configuring, securing, productionizing, and operating the software over its expected lifetime, including all infrastructure, personnel, training, and subscription costs.
Specifically, for this comparison, we define TCO as a combination of the following components:
Infrastructure: The cost of computing and storage, in this case on AWS
Administration: The cost of deploying, installing and upkeep of clusters
In order to establish an infrastructure cost comparison, we spent time running benchmarks to compare the performance of Kafka against Redpanda. By establishing infrastructure profiles with broadly similar performance characteristics, we were then able to calculate the cost differential between the two platforms.
Sizing for throughput at a given latency
Redpanda is built on the assumption that software should be able to make full use of the hardware on which it is deployed. Redpanda is designed to fully saturate fast SSD or NVMe devices, and take advantage of multi-core and high-memory machines.
We ran over 200 hours of tests across small, medium, and large workloads to generate a performance profile for both Kafka and Redpanda. You can find the full results in our detailed report.
When running our performance tests we were looking for low and predictable end-to-end latency. We’ve adjusted the node counts to ensure that latency remains relatively stable (i.e. The system is not overloaded, even at high throughput).
Evaluating our results (see our performance benchmarking detailed report linked above) yielded the following observations:
Redpanda’s average and P99+ end-to-end latency profiles remain incredibly consistent even at high throughputs.
Kafka could not handle workloads at 500MB/sec or above (1GB/sec total throughput) with 3 nodes. The tests could not complete at the required produce rate.
We had to repeatedly create bigger Kafka clusters to keep latency profiles flat, and even so, P99.9 latencies were above 200ms at 3x the cluster size of Redpanda.
For smaller workloads, Redpanda was able to run slightly faster on the cheaper AWS Graviton (ARM) CPUs, whereas Kafka was unable to operate on these instance types at any level of performance.
In order to build our cost model, we need to size our Redpanda and Kafka clusters to achieve comparable performance characteristics, our target being for P99.9ms latency to be less than 20ms. Our performance testing suggested the following sizing needs (even though we were required to scale Kafka to up to 3x the number of nodes as Redpanda to even get within 20x the latency in some cases – see the performance benchmark linked above for more details).
Figure 01: Comparing infrastructure requirements across small, medium, and large workloads at a target latency profile.
One of the major benefits of running Redpanda is the simplicity of deployment. Because Redpanda is deployed as a single binary with no external dependencies, we do not need any infrastructure for ZooKeeper or for a Schema Registry. Redpanda also includes automatic partition and leader balancing capabilities so there’s no need to run Cruise Control.
In the cost model below, we’ve shown the infrastructure costs for running Redpanda across our small, medium, and large workloads. This includes the number of brokers needed to run the workload itself and, for Kafka, instances to run Cruise Control, Schema Registry (2 nodes for HA), and ZooKeeper (3 nodes).
Note: Apache Kafka added support for KRaft as a replacement for ZooKeeper in the 3.3 release; however, it is not yet feature-complete. We expect it to be a while before this feature is widely adopted and have not factored KRaft into our cost model.
Small Workload – 50MB/sec
For the small workload, we noticed that Redpanda and Kafka had a similar performance profile running on i3en.xlarge instances whereas Redpanda was able to show performance gains against Kafka on the smaller i3en.large instances. We did note, however, that we weren’t really able to fully utilize the i3en.large machines, simply because the workload was not large enough. By introducing AWS Graviton (ARM-based) instances, we were actually able to improve the performance of Redpanda at a significantly lower cost point. As discussed in more detail in the performance blog, Kafka was unable to run on the Graviton instances.
The cost comparison in this table compares Kafka running on i3en.xlarge against Redpanda running on is4gen.medium instances.
Figure 02: Infrastructure cost comparison for 50MB/sec workload between Kafka and Redpanda.
Compared to running Redpanda on AWS Graviton instances, Kafka comes in at 3-4 times more expensive. Performance on the i3en.large instances for Kafka was not as good as Kafka on i3en.xlarge nor Redpanda on the same hardware or Graviton. Annual cost savings of up to $12,969 are available by using Redpanda for this workload.
Medium and Large Workloads – 500MB/sec and 1GB/sec
We saw similar results for the medium and large workloads – on identical hardware configurations (3 nodes) Kafka was unable to complete the workload at the required throughput so we were required to add nodes to try and get comparable results. In order for tail latency to be within a tolerance of 3x Redpanda’s for the medium and large workloads, we needed to scale Kafka up to 9 nodes, which has a significant infrastructure cost impact.
The following tables compare Redpanda and Kafka with the requisite number of nodes to sustain the throughput at reasonable latency thresholds. All of these tests ran on i3en hardware.
Figure 03: Infrastructure cost comparison for 500MB/sec workload between Kafka and Redpanda.
Figure 04: Infrastructure cost comparison for 1GB/sec workload between Kafka and Redpanda.
On infrastructure costs alone you can expect to see cost savings of between $80K and $150K depending on the size and scale of your workload, which can represent a 3x cost saving against Kafka.
Redpanda is designed first and foremost for usability and simplicity (along with record-breaking performance and data safety). Because Redpanda does not need a JVM or ZooKeeper, we often hear from users who have been able to significantly reduce the amount of monitoring and tuning that is required for a Redpanda cluster compared to an equivalent Kafka cluster.
The following Redpanda features all contribute to a much lower administrative burden than Kafka:
Autotuner will auto-detect the optimum settings for your hardware and tune itself to best take advantage of your specific deployment.
Leadership balancing – Improves cluster performance by ensuring that leadership is spread amongst nodes (and indeed amongst cores, so you don’t end up with multiple leaders hot-spotting specific cores).
Continuous Data Balancing (new in 22.2) – Automatically moves data from nodes that are running low on disk, or on node failure, to ensure that performance is maintained throughout the cluster.
Maintenance mode – Allows graceful decommissioning of nodes by transferring leadership onto other nodes ahead of a shutdown (for patching or disk maintenance).
Rolling upgrades– Upgrade the cluster without any interruption to consumers or producers.
Redpanda is also designed with data safety in mind as highlighted in the report from Jepsen. Improved data safety significantly reduces the operations and management overhead of running a Redpanda cluster and therefore reduces costs in this area.
The report highlights the key design differences between Redpanda and Kafka, specifically around weaknesses in Kafka’s ISR mechanism that can lead to data loss or unsafe leader election. Redpanda has no such weakness and as a result is much more stable under failure scenarios, including benefiting from having a single fault domain (compared to Kafka having ISR and ZooKeeper/KRaft as fault domains).
In building indicative cost comparisons for Redpanda against Kafka we’ve worked against what our customers tell us about how they have simplified their operational demands when adopting Redpanda, specifically that they need to spend less time balancing partitions, tuning the JVM, ZooKeeper, and the operating systems, and recovering from outages caused by ISR problems. We’ve made the following assumptions based on direct feedback from our customers:
Running a 3-node Redpanda cluster at small, medium, and large instance sizes does not increase the operational complexity and can be done by an ops team that may be managing other platforms at the same time.
Running a 9-node Kafka cluster, plus 3 ZooKeeper nodes at high throughputs is a significantly more complex undertaking, with outages and maintenance much more likely to require manual intervention on a regular basis.
Figure 05: SRE team cost comparison for 50MB/sec workload between Kafka and Redpanda.
Figure 06: SRE team cost comparison for 500MB/sec workload between Kafka and Redpanda.
Figure 07: SRE team cost comparison for 1GB/sec workload between Kafka and Redpanda.
Comparing the TCO of Redpanda vs Kafka
We see that even for small workloads running Kafka can be 3x more expensive than running Redpanda and for larger more complex workloads this can rise to 5x or even higher.
In the consolidated cost model, we bring together the costs of hosting the primary cluster infrastructure and administration costs as specified above. In this model, we do not include the cost of any DR site, or associated data transfer costs, although it’s a fair extrapolation to say that infrastructure costs alone will at least double, with the additional infrastructure required to host a MirrorMaker2 cluster on Kafka Connect (although for Redpanda Enterprise it is possible to use S3 replication - for further details see our blog on HA deployment patterns).
Figure 08: Consolidated Total Cost of Ownership comparison of Kafka and Redpanda across all workloads.
All of the prices above compare Kafka with Redpanda Community edition. According to this model, savings in infrastructure and administrative costs can range from $76K for a small workload to $552K for large workloads, a multiplier of 6x.
Figure 09: Consolidated Total Cost of Ownership comparison of Kafka and Redpanda across all workloads.
Evaluating Additional Savings for Redpanda Enterprise
Redpanda Enterprise includes several features that can be leveraged to further reduce the TCO of a Redpanda cluster – even when compared against commercial Kafka offerings, including Redpanda Console and Redpanda’s tiered-storage capability.
Redpanda’s tiered-storage capability works by asynchronously publishing closed log segments into an S3 compatible object store such as AWS S3, GCS, Ceph, MinIO or a physical device such as Dell ECS, PureStorage or NetApp ONTAP. Redpanda’s tiered storage provides two additional features - firstly, the ability for slow consumers to read old offsets seamlessly, with no client changes and at high throughput and secondly the ability to create read-only topics on other clusters that can be used for analytical, machine learning or disaster recovery purposes.
Kafka is working on tiered storage under KIP-405; however, this is not yet complete and has been ongoing for over 2 years. Some commercial vendors do have support for proprietary tiered storage offerings; however, these solutions do not offer read-only replicas, nor the ability to rebuild a cluster in a DR scenario and, therefore, if Kafka is to be used in a DR active/passive topology an additional Kafka Connect and MirrorMaker cluster would be required.
The largest cost saving to be had is when sizing a cluster to have more than the amount of data retention than would otherwise be available on the provided disk at the throughput required.
Figure 10: Per-day storage requirement for instance types at workload sustained throughput.
Because S3 storage is significantly cheaper than SSD/NVMe-based instances it is advantageous to use tiered storage both to reduce cloud or infrastructure costs, but also to reduce the operational complexity of running a large Kafka cluster that is sized simply for retention.
The following tables provide an illustrative comparison of running Redpanda Enterprise, Commercial Kafka (including tiered storage, noting the limitations above and that the Kafka cluster needs to be larger anyway for throughput), and both Redpanda Community and Kafka without tiered storage available. For each workload, we evaluate the potential infrastructure cost that would be incurred at one-, two-, and three-days’ worth of retention, with the relative comparison to running Redpanda Enterprise with tiered storage enabled. This calculation gives us the value of Redpanda tiered storage over its open-source comparators.
Figure 11: Annual infrastructure cost comparison for three-day retention for 50MB/sec workload (comparing Kafka, Redpanda Community, Commercial Kafka, and Redpanda Enterprise).
Figure 12: Annual infrastructure cost comparison for three-day retention for 500MB/sec workload (comparing Kafka, Redpanda Community, Commercial Kafka, and Redpanda Enterprise).
Figure 13: Annual infrastructure cost comparison for three-day retention for 1GB/sec workload (comparing Kafka, Redpanda Community, Commercial Kafka, and Redpanda Enterprise).
In the tables above we can see that the incremental retention costs on clusters without tiered storage can be quite significant. The table below summarizes the results across all of the workloads:
Figure 14: Summary incremental value of Redpanda Enterprise over Kafka (infrastructure costs only).
We can see that the value of an enterprise subscription can range from $70K up to $1.2M or higher for bigger workloads or retention requirements. That is not accounting for the indirect values of Redpanda Enterprise features such as Redpanda Console with SSO and RBAC, remote read replicas, continuous data balancing, and hot-patching.
In this post, we have compared the TCO of running Kafka and Redpanda based on benchmarking that we have carried out on public cloud infrastructure.
Redpanda is between 3x to 6x more cost-effective than running the equivalent Kafka infrastructure and team, while still delivering superior performance.
And Redpanda Enterprise brings a number of features designed to make operating clusters easier, with Redpanda’s tiered storage delivering infrastructure savings of between $70K and $1.2M depending on the workload and size of the cluster, delivering infrastructure savings of 8-9x.
Overall, Redpanda is up to 6x more cost-effective to operate than Kafka, and Redpanda’s flexible deployment options mean that it’s simple to deploy both in Redpanda’s cloud, across your own cloud environment or self-managed on-premise, bare metal, or Kubernetes.
Take Redpanda for a test drive here. Check out our documentation to understand the nuts and bolts of how the platform works, or read our other blogs to see the plethora of ways to integrate with Redpanda. To ask our Solution Architects and Core Engineers questions and interact with other Redpanda users, join the Redpanda Community on Slack.
Let's keep in touch
Subscribe and never miss another blog post, announcement, or community event. We hate spam and will never sell your contact information.