Published: 2026-06-23

Sakana Fugu Ultra vs Claude Opus 4.8: 38-Task Battle Test

Name: Sakana Fugu Ultra vs Claude Opus 4.8: 38-Task Battle Test
Uploaded: 2026-06-23
Description: Nate Herk tests Sakana's viral Fugu Ultra against Claude Opus 4.8 across 38 tasks. Result: 36 ties, Fugu ~4.5x slower and ~5x more expensive.

Chapters / key moments (click to jump — plays here on the page)

Nate Herk takes Sakana's viral Fugu Ultra — a single API that orchestrates frontier models (Opus, GPT, Gemini) like a multi-agent router — and runs it head-to-head against Claude Opus 4.8 across 38 tasks. The result: 36 ties, with Fugu roughly 4.5× slower and 5× more expensive, largely because Opus is one of the very models Fugu delegates to.

Source video

"I Battle Tested Sakana Fugu's Fable Killer" by Nate Herk — Watch on YouTube →

Key Takeaways

Fugu is not a new LLM. It's a small "manager" model that breaks a task down and routes sub-tasks to frontier models (Opus, GPT-5.5, Gemini, and others), then has another model merge the results — a multi-agent system delivered as one API.
It runs inside Claude Code via a markdown config file plus an API key. Notably, the context window stays near zero through a long session because responses are routed through Fugu's server rather than filling Claude Code's own context.
The scoreboard: across 38 AI-generated, Codex-graded, mostly pass/fail tasks (puzzles, traps, specs, heavy algorithms), 36 ended in ties and Opus won 2. Fugu never clearly won — unsurprising, since Opus 4.8 is one of the models Fugu itself selects from.
Cost and speed are the story: Fugu's runs took 357 minutes total vs Opus's 80 minutes, and cost ~$50 vs ~$10 — about 4.5× slower and 5× pricier. Easy tasks Opus answered in ~6 seconds took Fugu several minutes.
The pattern isn't new. It's the same orchestration you already do pairing Claude Code sub-agents, or running Codex and Claude Code on one codebase — Fugu just automates the delegation. It differs from OpenRouter's Fusion API, which fans the same prompt to three models and judges/merges rather than splitting the task.
Honest takeaway: impressive benchmarks, but for knowledge work the cost and latency aren't worth it over a Claude Code or Codex subscription. The real value is for heavy, multi-team software development — and the broader skill of optimizing which model does which task is only getting more important.

Sakana Fugu Ultra vs Claude Opus 4.8: 38-Task Battle Test

Key Takeaways

More OpenClaw & Claude Code news

Go deeper: OpenClaw guides