Hello, Agent: How do you make AI agents fail-safe in a world of unreliable systems?

Durable execution, reliability engineering, and the future of agentic AI with Jeremy Edberg at DBOS

Hosted by

Alexander Gallego

Guest

Jeremy Edberg

May 28, 2026

Show notes

Transcript

Or listen on:

Guest:

Jeremy Edberg

Topics

Durable Execution

AI Agents

Reliability Engineering

Distributed Systems

Production AI

Enterprise AI

Show Notes:

Durable execution has been solving reliability problems for decades. Now it's the missing primitive for agentic AI systems.

We're joined by Jeremy Edberg, C-Suite Advisor at DBOS, Inc., to explore how durable execution works, why it matters for building production-ready AI agents, and what lessons from Netflix-scale reliability engineering apply directly to the age of agentic coding.

‍

Key takeaways

(00:00) Introduction & Show Overview Alex introduces Hello, Agent! and welcomes Jeremy to discuss durable execution and reliability engineering for agentic systems.

(01:10) Exploring the DBOS Concept Jeremy discusses how DBOS fits into the Postgres renaissance, where Postgres has become the foundation for databases, vector stores, and full-text search engines across the industry.

(02:47) Jeremy's Lifetime Obsession with Reliability Jeremy traces his journey from childhood anxiety about systems breaking, through his early career helping people get on the internet, to becoming an engineer driven by one core principle: if you do it twice, automate it.

(04:50) From Reddit to Building at Scale Jeremy reflects on his time as Reddit's first employee and his later work at Netflix, learning that reliability is as much a cultural imperative as a technical one in enterprise environments.

(09:25) Netflix's On-Call Culture & Automation Philosophy Netflix shifted from three engineers watching screens in shifts to one engineer on-call per week by building deep trust in their alerting and automation systems.

(10:41) The Durability Concept Jeremy introduces durable execution as the practice of saving execution state, similar to a video game save point, allowing workflows to resume from their last successful step.

(11:40) Why Durable Execution Matters for AI Agents Since agents make non-deterministic calls to LLMs, durable execution's ability to replay past work is powerful — you can debug agentic workflows without triggering new API calls.

(12:31) LLM Temperature & Non-Determinism The conversation explores how LLM temperature settings create randomness in outputs, meaning the same input can produce different results each time, making reproducible debugging critical.

(14:24) The New Bottleneck in Agentic Coding With AI agents writing code faster than humans, the bottleneck shifts from code generation to deployment, testing, and debugging — durable execution enables faster iteration on all three.

(15:00) Durable Execution Enhances Enterprise AI Testing Rather than replacing evaluation frameworks, durable execution provides the data needed for proper evaluation and enables autonomous testing where AI agents discover edge cases.

(21:08) Human-in-the-Loop Workflows Simplified Long-running agentic workflows can safely pause for human approval and resume cleanly without manual state management — eliminating the need for complex queue infrastructure.

(22:00) DBOS as a Library, Not Infrastructure DBOS simplifies deployment by providing durable execution as a library in Python, Java, Go, and TypeScript, removing the need for separate infrastructure services.

(24:00) AI Agents Will Eventually Write Infrastructure Code While Terraform ensures determinism today, Jeremy predicts agents will eventually write infrastructure configuration rather than humans, with Terraform potentially becoming a DBOS workflow itself.

(25:18) Real-World Use Cases Beyond AI Agents Use cases range from documentation automation (Dosu) to data pipeline reliability — like syncing SAP and Shopify systems to guarantee data consistency in both directions.

(29:14) Most Software is Actually Workflow-Based Jeremy clarifies that despite perception, most enterprise applications are business workflows rather than data-intensive systems, making durable execution broadly applicable.

(30:59) Future: Tightening the Agentic Development Loop Jeremy's vision is AI agents that code, deploy, run, detect errors, fix them, and continue — progressively tightening the feedback loop until human intervention becomes rare.

(31:43) From Four Nines to Five Nines: Self-Healing Systems Netflix reached four nines through chaos engineering. Five nines requires systems that fix themselves faster than humans — exactly what durable execution enables for enterprise AI.

(33:00) AI Governance & Business Constraints Durable execution provides the observability needed to enforce guardrails ensuring agents behave within business rules, not just technical ones, addressing enterprise governance concerns.

(34:10) Learning Through Observability Unlike humans, AI agents can't improve without complete data — durable execution records everything, enabling agents to learn from successes and failures to improve over time.

‍

Resources mentioned

DBOS, Inc: https://www.dbos.dev

Transact library: https://www.dbos.dev/dbos-transact

CockroachDB: https://www.cockroachlabs.com

Dosu: https://dosu.dev/

‍

#RealTimeData #DataStreaming #Redpanda

Transcript

Episode FAQs

Why is durable execution important for enterprise AI agents?

How does durable execution enhance debugging in AI systems?

What role does state management play in durable execution?

Other episodes

View all episodes

Dominik Tornow

Founder & CEO

@

Resonate

Building interruption-tolerant agents with durable execution with Dominik Tornow at Resonate HQ

Dominik Tornow explores why durable execution is critical for reliable AI agents and how durable promises simplify building long-running, interruption-tolerant multi

Play episode

Text Link

Nicolas Dupont

Founder & CEO

@

Cyborg

Designing secure architectures for AI agent-driven workflows

Nicolas Dupont explores the world's first confidential vector database and how to deploy RAG agents securely on regulated, sensitive data.

Play episode

Text Link

Durable execution, reliability engineering, and the future of agentic AI with Jeremy Edberg at DBOS

Show Notes:

Key takeaways

Resources mentioned

Transcript

Episode FAQs

Learn when each episode drops

Other episodes

Building interruption-tolerant agents with durable execution with Dominik Tornow at Resonate HQ

Designing secure architectures for AI agent-driven workflows

Stay up-to-date with the latest 'Hello, Agent' episodes