Engineering Redpanda for multi-core hardware

Meet Noah Watkins and learn what it’s like to work on platforms engineered for multi-core systems.

Noah Watkins

May 18, 2022

Last modified on

CopIED!

I grew up in a small, rural town in Kansas, and studied systems programming close by at the University of Kansas. At the time, I was interested in staying in academics, and so I moved out West to join the Computer Science department at UC Santa Cruz, where I received a PhD studying the programmability of storage systems. It was during graduate school that I was first exposed to large, distributed systems like Ceph (also developed at UC Santa Cruz) and all of the other amazing work that emerged as the world began scaling systems for the cloud era. Ultimately, I decided to leave research behind, and I joined Red Hat working on a team that was supporting Ceph within Kubernetes.

If I’m being honest, quitting academics cold-turkey was tough. It left a large, nerd-snipe-shaped hole in my heart. I didn’t want to return to research, but I thought that getting involved with a greenfield project might scratch that itch. I was lucky enough to get that opportunity after I began chatting with Alex Gallego on Twitter. In August of 2019, I started working at Redpanda on core infrastructure. Today I live in San Francisco and, three years later, am still happily a part of the same Core team that develops the internals of the Redpanda streaming data platform.

In this post, I discuss some of the cool projects I’ve had the opportunity to collaborate on, the technologies we use on the Core team every day, and what we look for when hiring new developers.

Engineering with modern computer hardware in mind

At Redpanda we talk a lot about being able to take advantage of modern hardware. To some this may sound like a platitude, but in reality there are interesting engineering challenges that have emerged as hardware has advanced over the years. What does this mean?

Let’s take a step back and first understand where we have come from. Engineering a distributed storage system is a complex process and, when combined with a key goal of maintaining data safety over long periods of time, it produces systems that evolve slowly after they have been built, matured, and earned trust. A corollary to this is that the implementation and architecture of storage systems tend to reflect the hardware that was available when the system was created. And, for many existing systems, that means spinning hard disks, small numbers of CPU cores, slow networks, and limited memory capacity.

In contrast, the hardware landscape today has changed in dramatic ways. Let’s use storage device latency as an example. Today, NVMe devices can deliver latency that is orders of magnitude lower than the spinning disks of the past. Unfortunately, it is not sufficient to simply swap out slow disks for non-volatile memories and realize orders of magnitude difference in the performance of a storage system. To understand why, consider a storage system built using traditional techniques like threads, locks, and queues. These devices have a measurable cost compared to single-threaded performance — when threads spend time context switching, that is pure overhead compared to a thread using a CPU at 100% utilization exclusively for application-level instructions.

“But, wait”, you might say, “context switching, and the ability to use threads and locks, is a cost we all pay for the benefit of being able to write scalable, concurrent software”. That is true, and when a system assumes a slow spinning disk, these overheads don’t matter much. But in a system with super fast NVMe storage, the costs of context switching and interacting with heavy weight resources like locks, queues, and shared memory add up quickly and it becomes challenging to take advantage of the performance of the storage hardware.

Effectively, the CPU has become the bottleneck. Engineering Redpanda to eliminate this bottleneck — not just for storage but for other resources as well — is precisely what we mean when we talk about being able to take advantage of modern hardware. It’s about a fundamental and intentional way of building systems that differs from the past when the CPU was the star of the show.

Redpanda scales along with your hardware

A key enabling technology that allows Redpanda to scale from small machines up to the latest hardware (e.g. 96 cores and 1TB+ RAM) is a runtime framework called Seastar. This framework allows us to build server applications that are capable of delivering data directly off the network and through the entire Redpanda storage engine using a single thread of execution that is unencumbered by context switch overhead, heavy-weight concurrency control mechanisms, and difficult-to-manage resources like the kernel buffer cache. It is this ability for operations to “run to completion” (a great term borrowed from the Reflex paper) that is enabled by Seastar, and which provides Redpanda with the capability to fully utilize modern hardware.

Looking towards the future of Redpanda, I’m especially excited about the integration of computation into the storage system. Redpanda exposes this capability today through the use of WebAssembly (Wasm), allowing topics to be transformed with arbitrary code that runs safely within an isolated environment, close to the data. However, there is exciting work still to be done related to how we will manage resources in a way that balances the needs of computation with continuing to deliver low-latency I/O performance.

Developing Redpanda’s Kafka compatibility layer

One of my favorite projects at Redpanda has been the work we have done on the development of Redpanda’s Apache Kafka^Ⓡcompatibility layer. This layer translates the official Kafka network protocol onto the Redpanda storage engine. It’s an intricate sub-system, composed of over 100 code-generated network protocol parser functions, security features such as authentication and ACLs, consumer groups, and transactions.

It has been satisfying to work on this, in part, because of its size and complexity, but also because a large percentage of its behavior is well-defined and testable. The entire end-to-end experience of adding support for a new capability and testing it with off-the-shelf Kafka clients and existing applications has been especially rewarding from a technical point of view. There is something special about being able to go from protocol specification to a compliant implementation in Redpanda, and then observe any chosen Kafka client behave as expected.

Supporting the expansion of the Engineering team

Since starting at Redpanda in 2019, my role has evolved from consisting primarily of living inside vim and writing C++, to include a wider range of responsibilities. A significant chunk of my time is now dedicated to reviewing code and design proposals, participating in the hiring process, and helping to educate new developers as they come up to speed on the unique aspects of the system we are all building together.

Practically speaking this means that I’m spending less time in the terminal, and more time communicating via GitHub, chat, e-mail, and Zoom. I still actively participate in developing new features, but I’m finding that I can also provide value by jumping in where needed so that others can stay focused, whether that means I’m investigating a gnarly compiler bug, spelunking in some dark corner of the C++ standard, or working with a customer to understand how we can improve their experience.

The Core team has a diverse set of experiences and proficiencies. We have engineers who are early in their careers as well as developers who have been in the software industry for a long time. Furthermore, our backgrounds span a range of application domains, and are not limited to distributed systems and storage. One challenge that this creates is handling the large set of permutations involving how each of us best learns and communicates, as well as the assumptions and background knowledge that we bring to the table. Being patient and adaptable in how we consume and produce information is a skill that goes a long way on a remote team but, like any skill, it takes persistent effort to develop.

Because of the diversity in our skills and the fact that we’re a remote-first company, we place a high value on communication in the hiring process. We explicitly avoid the gauntlet of whiteboard interviews and impossible puzzles that seem to have become too common in the software industry. Instead, we prioritize good communicators, curious and deep thinkers, and those having familiarity with our core set of languages C++, Python, and Go.

Push the limits of modern hardware: Join our team!

Being part of Redpanda’s Engineering Team means having the opportunity to learn and work with some of the latest technologies available, as well as being part of a really fine group of humans.

Now that you know a bit more about the work we do and what we look for in engineering applicants, consider joining a team that’s constantly pushing the limits of data streaming on modern hardware: apply on our Careers page! Join us on GitHub and Slack, and follow me on Twitter at @dotnwat.

‍

No items found.

Join the Redpanda Community on Slack

Chat with our team, ask industry experts, and meet fellow data streaming enthusiasts.