Published: 2026-05-07

AgentSpan Makes LangChain and CrewAI Pipelines Crash-Proof with Per-Step Persistence

AgentSpan is an open-source (MIT), self-hosted runtime that solves a fundamental problem with production AI agent pipelines: when a crash or restart happens, agents currently re-run from scratch and repeat every side effect — duplicate emails, double database writes, extra API calls. AgentSpan fixes this by moving orchestration state to a separate server and persisting every individual tool call, not just nodes. Built by the team behind Netflix Conductor.

Source video

"Agentspan: Build Crash-Proof AI Agents Pipelines (Free) with LangChain/LangGraph/CrewAI" by FuturMinds — Watch on YouTube →

Key Takeaways

Every individual tool call and LLM call becomes its own persisted task — more granular than LangGraph checkpointers, which only save at the node level.
Orchestration state moves to a separate AgentSpan server; your Python code becomes a worker that registers functions and receives dispatched tasks.
A crashed pipeline resumes from the exact failed step — completed tool calls are not re-executed, preventing duplicate emails, charges, or API calls.
Human-in-the-loop approvals are stored as durable tasks — a thread doesn't need to stay open for hours or days, and approval state survives process restarts.
MIT-licensed, fully self-hosted — your agent data never leaves your own infrastructure.
Works alongside LangGraph, CrewAI, and OpenAI Agents SDK — no framework migration required, just a few lines of change.

The Problem: Why Agent Pipelines Fail in Production

Every major agent framework today — LangGraph, CrewAI, OpenAI Agents SDK — runs the entire pipeline inside your Python process. Every LLM call, every tool call, every intermediate result lives in process memory. This creates four compounding problems: a single crash or container restart forces a full restart from scratch; there's no visibility on which agents completed or what they produced; restarting causes duplicate side effects from already-executed tools; and human-approval workflows require blocking an open thread for hours or days.

LangGraph has checkpointers (memory server, Postgres) but they work at the node level — they save state between agents, not within them. If a researcher agent calls three tools and crashes after the second, the checkpointer doesn't help: the entire node reruns. FuturMinds demonstrates this exact scenario: a researcher-writer-editor pipeline that crashes during the writer stage ends up sending duplicate notification emails to the team when restarted.

How AgentSpan Fixes It

AgentSpan separates your code from the execution state. Your code still defines the agents and tools — nothing changes there. But orchestration state (which agent runs next, what each previous agent produced, where in the pipeline you are) moves to a separate AgentSpan server. Your Python process becomes a worker: it registers tool functions with the server, receives dispatched calls, and returns results. The server persists every completed step.

When your Python process dies, the server still has the full workflow state. It knows which steps completed and which were in progress. When your process restarts, it picks up exactly where it left off. Already-completed tool calls are skipped entirely — no duplicate notifications, no double-charged transactions. Tool idempotency is built in at the platform level, not left to individual developers to implement per tool.