Out of memory (OOM) events are common in the Linux environment when there are programs that allocate a lot of memory. Redpanda is one such program, as it uses the Seastar library, which tries to utilize whole hardware to its limits.
There is special kernel functionality, called Out Of Memory Killer (OOM Killer), that helps keep Linux machines operational by killing the biggest process with the least priority. OOM Killer can recognize and respect processes that have constraints in Linux
Unless you specify input parameters, Redpanda reads hardware-available memory and sets aside at least 1.5 GiB for the operating system (OS) and divides the rest equally for each machine core in order to maximize efficiency of the Seastar memory allocator. If Redpanda is running alongside other programs, the Linux OS might run out of memory.
If you’ve also experienced problems with OOM Killer, keep reading to learn how we resolved our issues with it so you can do the same.
How OOM Killer began interrupting our sidecar
When we began experiencing problems with OOM Killer, Redpanda Cloud used (and still does) Kubernetes (K8s), and relied on
cgroups and Linux namespaces to constrain the workloads. If Redpanda wasn’t told what memory parameters it should pick, then the underlying Seastar library would allocate 1.5 GiB for OS, and the rest from the
cgroup would be divided among the number of available CPU cores.
Such a setup didn’t make sense for a containerized environment where Redpanda was isolated from any other process. Hypothetical users of the Redpanda operator shouldn’t have to worry about how to set up the Redpanda advanced memory parameters but, depending on your desired capacity, adequate K8s nodes must be available for Redpanda, and correct limits and requests need to be set. The first sizing for the Redpanda pod in K8s reserved 0.5 GiB of the memory to the other pods running in a dedicated Redpanda node.
To automate and ease the K8s deployment of Redpanda, we created an operator. In order to constrain Redpanda and leverage
cgroup capability, we provided a resource configuration option in the cluster custom resource. This configuration was mapped directly to the Redpanda configuration so that Redpanda could use all memory available to the container.
In our first Redpanda operator implementation, the K8s deployment resource was configured to not overwrite the container entry point. The default entry point leveraged
supervisord to schedule Redpanda processes, telemetry reporting, and WebAssembly (Wasm) coprocessors. That simplification played a role in local environment deployments (e.g.
When Redpanda warmed up its cache, OOM Killer saw that memory inside the Redpanda
cgroup was exhausted, and it killed the biggest Redpanda process. Users would see that the broker was unavailable until container runtime restarted the Redpanda process. The Redpanda operator could automate the same function as supervisord by scheduling only one process inside one container, and the container runtime would do the heavy lifting and isolate each process. Debugging further problems was made easier by the fact that OOM Killer recognized individual processes and only those were affected.
The first solution we tried to resolve the OOM Killer events involved the K8s deployment, where every process was running in its own dedicated container. By investigating this potential solution, we saw that
rpk debug info, which sends telemetry data, was executed every 10 minutes. The problem was that Redpanda had a higher-than-usual load, and our sidecar used more memory than was set in
cgroup. Then the OOM Killer started to kill this sidecar container.
Next, the Cloud team optimized the managed solution, so we eliminated all sidecars from the deployment. The telemetry was moved outside the Redpanda pod and Wasm coprocessors were disabled until GA. With only one Redpanda process running in the pod, memory
cgroup constraints were mapped to Redpanda memory. In long-running clusters, memory allocation grew to the point where, from the OS perspective, all available memory was consumed by Redpanda. The processes were again killed by OOM Killer. At this point, we were looking for a bug in Redpanda, but it turns out that K8s pod implementation is backed by
a pause container.
Solving the OOM Killer challenge
To create a container sandbox and be able to restart individual containers in a multi-container pod setup, pause processes play a crucial role to orchestrate other processes. Looking at the source code, this process might seem to not be that big in terms of memory, but it needs one page from the operating system just to work. This one page plays a key role when OOM Killer scans all
cgroups, and finds that the Redpanda container overflows its memory usage.
Once the OOM Killer report proved that the pause container was listed along Redpanda process, we implemented memory reservation to solve this issue. With a single container, we couldn’t allocate whole memory to the Redpanda process.
The Redpanda operator extends cluster custom resource definition to include Redpanda resource configuration. Now,
cgroup memory is not tight with Redpanda memory maximum allocation. Depending on the K8s worker node size and the traffic in particular, node clients can assign less memory to Redpanda in comparison to the container.
The next improvement we made to resolve our issues with OOM Killer was to add 10% default memory reservation to the OS. This was done in order to prevent memory pressure in overprovisioned K8s worker nodes. If Redpanda operator users would not set Redpanda memory, then — in big enough clusters where all memory limit was distributed among all pods — clients could observe memory pressure events. With spikes in traffic and Kafka clients' usage, the SRE team might observe that default kubelet memory host reservation is not enough for the operating system. This 10% memory reservation mitigation was implemented to help clients that were using the Redpanda operator already. An operator upgrade would recalculate necessary memory reservation. This solution, instead, gives room for a pause container and other kernel data structures that are necessary for the K8S node to work correctly.
Optimizing resource consumption in bigger machines
In the bigger clusters (e.g. 16 cores and 64 GiB), Redpanda needs to give more room to the auxiliary services. Each core will be occupied by the Redpanda shard. That shard doesn’t overload the metrics system or logging aggregator but, when it’s multiplied by the number of cores, it can significantly change the resource requirements (for example, Prometheus for metrics or FluentBit for logging). While OOM Killer was looking at the biggest processes with the lowest priority inside each
cgroup, Redpanda was picked to be terminated. K8s node-exporter started to report node memory pressure events. For our biggest deployments we adjusted memory to leave more room for
Ironically, what's interesting is that, to prevent OOM kills of Redpanda, we actually reduced the amount of memory Redpanda used. Firstly by reducing the amount of memory allocated to the
cgroup, and then by reducing the amount of memory Seastar can use within that
How to adjust default memory allocation
Despite encountering these challenges with the OOM Killer, we were able to effectively troubleshoot these memory usage issues. We are now more mindful about resource constraints in a containerized environment. All improvements were done to our observability stack and Redpanda operator to ease the debugging experience of losing Redpanda nodes.
For any user of the Redpanda operator, the most important thing is to understand that, by default, the operator will assign 10% of the provided K8s resource requests.
If users want to change the 10% threshold in the cluster custom resource section, they must calculate requests, limits, and Redpanda options to match the desired configuration:
yaml kind: Cluster spec: resources: requests: cpu: 2 memory: 2.23Gi redpanda: cpu: 2 memory: 2Gi limits: cpu: 2 memory: 2.23Gi
If they do not need to change the 10% memory cushion, the Redpanda section can be omitted.
Not only should the
cgroup be taken into account, but so should the overall memory resource exhaustion on the K8s node.
By optimizing the overhead of the containerized environment, we’re able to provide a better-managed cloud experience and meet our users wherever they are in their streaming applications journey.
Let's keep in touch
Subscribe and never miss another blog post, announcement, or community event. We hate spam and will never sell your contact information.