Autotune series: Part 1 - storage

Autotuning, or increasing the vertical scalability via performance engineering, is focused on the time, energy and ultimately, the cost to run a system.

Michal Maslanka

Alexander Gallego

April 13, 2019

CopIED!

Why it matters

The short version, of course, is money. 50% of the total cost of ownership for a storage system is spent on personnel, who in turn are spending 25% of their time tuning these systems[1]. Some of us have even built careers on performance engineering.

Autotuning, or increasing the vertical scalability via performance engineering, is focused on the time, energy and ultimately, the cost to run a system.

Redpanda is a queue that can go at the speed of hardware - measured within 10% of the physical limits. Before we get into the weeds, the good news is that you - our loved user - should not give one f@&k darn about it because it is all automated, so you can focus on the important things in life... like planning your ultimate frisbee interoffice tournament.

Multiple levels of autotuning

Let’s be clear, no amount of fiddling with a system will save you from bad architecture. Poor locking, contention, bad points of serialization, poor memory cache lines use, excessive message passing will crawl a system, even if you drive emacs in 144 cores, it will still be slow, because its architecture is single threaded.

Architecture

Redpanda uses a shared-nothing[2] and structured message passing architecture[3]. Each core owns a set of partitions (data model of namespace-topic-partition tuple). Each partition is a raft group replicated across a set of machines. Locally, there is an exclusive reader and writer per log segment (a file on disk containing actual data).

We have no locks on the hotpath. 100% of the machine memory (minus some small amount for OS) is allocated up front to amortize the cost of management, allocation, deallocation. All subsystems have a min-max reservation quota to apply backpressure. Only the min is guaranteed to be available at all times.

Our log segment writers use Direct Memory Access (DMA) which means that we bypass the page-cache for reads and writes. We manually align memory according to the layout of the filesystem and spent a great deal of effort ensuring that we only flush as little data as we can to the actual physical device.

Physically, we dispatch multiple large buffer writes to disk concurrently, relying heavily on the sparse filesystem support of XFS. While measuring, we determined there is a factor of 10X in using sparse filesystem concurrent writes[4] followed by a fdatasync() than running a filehandle with O_DIRECT with FDSYNC forcing the OS to dispatch a disk controller flush() per page-aligned write.

Visually, it looks like this:

Runtime

The runtime can effectively be described from an outsider’s perspective as a never ending whack-a-mole. This is where 25% of the cost of your DBAs come from[1]. Do not despair, your dreams have been answered. This is what we are eliminating today with our autotuner tool - RPK: Redpanda Keeper... yas!

Given that Redpanda has virtually no tunables, when we mean ‘Runtime’ we mean the Linux Kernel. The thing you love, fight, pray, beg, implore, but ultimately submit to. The truth about your computer is that it does not run on some idealized vision of hardware, it probably runs some rebranded NVMe SSD mounted on /dev/nvme0n* put together in Taiwan, with an Intel Xeon chip prototyped in Arizona and mass produced in China. The runtime has to deal with the real world, bad drives, damaged memory dimms and, of course, your software.

Add that to machine-specific and use-case dependent Kernel settings like interrupt affinity, interrupt coalescing, I/O schedulers, CPU governor, etc, and the previously negligible cost of the I/O stack becomes a major contributor to latency stalls for an untuned system. The performance you get on this machine is unique at a micro scale.

Interrupt management

Modern I/O devices (i.e.: NICs, NVMe SSDs) generate hundreds of thousands of high priority interrupts per second on a busy system, ballooning your latency if a single core is designed to handle them. As the core’s ISR (Interrupt Service Routine) fulfills the interrupt, it causes a context switch from kernel to userspace which at 5μs per context switch and 100K interrupts/sec, it means 500ms artificial latency. Ouch.

This ignores the fact that if a single core is handling these interrupts, it is probably having to move all the data back and forth between other worker cpus, but… moving on… moving on...

Since Linux 2.4, we gained the ability to manually assign certain IRQs (Interrupt Request) types to specific CPUs. This mechanism is called Symmetric Multiprocessing (SMP) affinity. Distributing interrupt handling among the designated CPUs and binding IRQs to specific cores improves the I/O throughput and can reduce latency, by allowing the core that issue the request to handle the interrupt itself. Now that 500ms on a 64-core system looks a lot like a 7ms. Nice!

In practice, the only thing that matters when it comes to issuing iocbs is what cores are handling and managing the request lifecycle. On recent SSDs from Intel or Samsung, you can buy an off-the-shelf 500K 4KB IOPS drive. That means that the drive - at best - can give you per core (assuming 64 cores) 7.8K IOPS/core, which means around 31MB per core per second.

With these measurements you can start to reason about latency and throughput of your system. You know that 31MB will finish within 1 second tail on every core. This is a bottom up approach one can use when building SLA’s in a product.

Sadly, it is not as easy as multiplying the number of cores and devices and then say - Oh! We can handle 1TB/sec. This is because performance is not composable. The modalities of performance curves are such that when you saturate a system it follows Little’s law. Things will scale somewhat-linearly for some time and then it all goes downhill when you hit a limit. In some cases it could be your PCIe bus not having enough throughput to feed all the drives, in many cases it will be the speed of the hard drive itself, in some will be the CPU time spent decompressing your RPC. However, not all hope is lost. There are things we can do, like using a noop scheduler.

[noop] - Our I/O scheduler marriage

Not all I/O requests are made equal. Background compaction, in-flight read & write requests that interact with users, clean up procedures, log expiration, etc, are actually all still competing for a physically bounded channel that is sending bytes to disk. There is a limit per device and we want to be in control of what, where and how do the I/Os get dispatched, reordered, merged and canceled. The Linux kernel offers a noop that processes requests in FIFO order, allowing the application to do the smarts.

Introducing Redpanda Keeper (`rpk`)

Understandability is the only way to run large scale systems. However, understanding your system, your vendor’s system, the interaction between both alongside the massive tuning options of the Linux Kernel can be daunting.

At Redpanda we decided to make system performance tuning as easy as rebooting your computer. Our rpk tool will take any ol’ machine and tune the Network (per NIC), Disk (per device), CPU p-states, frequences, and so much more. If you can tune the hardware better than our tool can, we have a bug - that’s our promise. Stay tuned for part 2 and how we reason about networking.

‍

[1] Automatic Database Management System Tuning Through Large-scale Machine Learning - https://db.cs.cmu.edu/papers/2017/p1009-van-aken.pdf
[2] seastar.io - share nothing architecture framework
[3] Seastar Futures - https://www.alexgallego.org/concurrency/smf/2017/12/16/future.html
[4] Sparse File Tests on XFS - https://www.alexgallego.org/concurrency/o_direct/2018/02/02/O_DIRECT.html