Site Reliability Engineer

Design, build, and operate a world-class real-time streaming cloud platform

We are building Redpanda, a real-time streaming engine for modern applications. Redpanda is used by Fortune 1000 enterprises pushing hundreds of terabytes a day, and by the solo dev prototyping a React application on her laptop. We go beyond the Kafka protocol into the future of streaming, with inline WASM transforms and geo-replicated hierarchical storage. Think of it as a data API platform that scales with you from the smallest projects to petabytes of data distributed across the globe.

We are on a mission to enable every developer to supercharge their real-time applications.

You Will

You will be a part of our cloud team, working with all of engineering on building new services, automating infrastructure lifecycle on Kubernetes, and monitoring our services with the goal of offering a reliable, scalable and high-performance SaaS. One of our primary goals is to run a managed, cloud-based streaming-as-a-service with 99.5% uptime or better, and this role is critical for that goal.

  • Build & design Redpanda’s cloud infrastructure with reliability and performance in mind.

  • Build tools & services to allow automated infrastructure management and self-healing, including deployments and upgrades.

  • Be in charge of end-to-end monitoring of our cloud. Layer observability into our Kubernetes operators. Prioritize what metrics to collect, drive analysis of those metrics, and influence our roadmap based on that analysis.

  • Participate in on-call rotations, working to keep customer workloads running and incident free.

You’ll be part of a diverse team with members in both US (New York City, San Francisco, San Diego, Austin, Denver) and international locations, including Colombia, Denmark, the United Kingdom, Russia, Poland, Czech Republic, Germany, Greece, Japan, and growing!

You Have

  • 3+ years of experience in an SRE-like role

  • Comfortable working with a 100% distributed engineering team, collaborating on GitHub, in the open

  • Strong experience with public cloud providers

  • Experience running highly-scalable production workloads reliably on Kubernetes

  • Experience with monitoring at scale

  • Experience managing infrastructure predictably through GitOps and IaC

  • Solid programming skills

  • Willingness to participate in an on-call rotation

  • Excellent written communication skills

Nice to have

  • Strong understanding of Go and Kubernetes

  • Experience operating a SaaS platform

  • Fluency in a couple of programming languages (for example, Go or Python)

  • Operated and used streaming platforms either as a user or provider

  • Experience with the Prometheus monitoring stack