SpatialClaw: NVIDIA's training-free code-writing spatial agent
NVIDIA and KAIST researchers released SpatialClaw, a training-free spatial-reasoning agent that hands a vision-language model a persistent Python kernel and lets it write, run, and inspect code one cell at a time — rather than committing to a single up-front program or a rigid, one-tool-at-a-time JSON interface. Fahd Mirza walks through the five-stage agent loop, the self-correction that lets it recover from its own mistakes mid-run, and benchmark results where SpatialClaw beats the previous best spatial agent by 11.2 points with no task-specific training. His honest caveat: it's a research-cluster project, not a single-GPU download.
"SpatialClaw - Why Code Is the Right Interface for Spatial AI Agents" by Fahd Mirza — Watch on YouTube →
Key Takeaways
- The thesis: capability is bounded by composition, not tools. Two agents with the identical toolset get wildly different answers depending on how they're allowed to combine those tools. Letting the agent write code — instead of dispatching one fixed tool at a time — is what unlocks correct spatial reasoning.
- A persistent Python kernel is the core mechanism. Perception outputs become ordinary variables the agent can manipulate with NumPy, SciPy, or Matplotlib. Every variable created in one cell survives into the next, so the agent builds up state as it reasons.
- Five-stage loop per question: (1) a planning step reads the question and metadata without images and produces a structured plan; (2) the VLM writes a Python cell with a stated purpose, reasoning, and next goal; (3) the cell passes an AST safety check; (4) it executes in the kernel; (5) stdout, errors, variable summaries, and any rendered images feed back as the next observation. The loop repeats until the agent calls
return. - Self-correction is the standout behavior. In the demo the agent segmented objects, visually verified the masks, then realized its centroid method (a median) was wrong for a closest-point distance — and switched to
scipy.spatialKDTree to land the correct 0.94 m answer. Single-pass and rigid JSON-tool baselines both got it wrong. - Reported result: +11.2 points over the previous best spatial-reasoning agent across 20 benchmarks, with no benchmark-specific tuning and consistent gains across every backbone tested.
- Hardware reality check: Mirza couldn't reproduce it on a single H100 — it expects multiple H100/A100 GPUs plus separate perception servers running SAM 3 and Depth Anything 3. Treat it as a research project, not a consumer-GPU install.
How to run it (as described in the video)
Mirza didn't do a hands-on install (no multi-GPU cluster on hand), so these are the high-level steps he read off the repo rather than commands executed on camera. Check NVIDIA's official repo for exact syntax before running.
# 1. Clone the repo with submodules
# 2. Run the setup script (first run takes ~1–2 hours)
# 3. Configure your API keys, or point it at a self-hosted vLLM instance
# 4. Stand up GPU perception servers (SAM 3 + Depth Anything 3)
# 5. Run an experiment against one of the 20 supported benchmarks





