Published: 2026-05-13

Speed Up OpenClaw 2–3x With DFlash Speculative Decoding on Local GPU

DFlash is a speculative decoding inference engine that uses block diffusion—proposing entire token blocks at once rather than one token at a time—to deliver 2–3x faster generation on the same GPU hardware. Fahd Mirza shows how to serve DFlash as an OpenAI-compatible endpoint on port 8080 and point OpenClaw at it as a custom provider. With tool calling now supported, DFlash can back any agentic harness including OpenClaw, Hermes Agent, and Codex—completely locally with no API costs.

Source video

"Luce DFlash Meets OpenClaw - Local AI Agents at 2x Speed with Qwen3.6-27B" by Fahd Mirza — Watch on YouTube →

Key Takeaways

DFlash serves as an OpenAI-compatible API on port 8080—point OpenClaw's custom provider URL to http://localhost:8080 with no API key required.
2–3x speed gain over standard autoregressive inference on the same hardware using block diffusion speculative decoding.
Tool calling now supported—Hermes Agent and Codex can also use DFlash as their local backend.
65k token context fits in ~20GB VRAM using 3-bit KV cache compression (TQ3_0 flag) with a speculation budget of 8.
Setup: clone DFlash repo → conda env → build → serve with KV cache flags → install OpenClaw → set custom provider to localhost:8080.

Commands & Code Mentioned

git clone https://github.com/dflash-ai/dflash
conda create -n dflash python=3.10 && conda activate dflash
# Build DFlash (see repo README for full steps)
# Serve on port 8080 with 3-bit KV cache:
DFLASH_KV_TYPE=TQ3_0 DFLASH_PREFILL_UBATCH=512 \
  dflash serve --model luce-dflash --ctx 65000 \
  --speculation-budget 8 --port 8080
# Install OpenClaw, then configure custom provider:
openclaw setup --provider custom --base-url http://localhost:8080

This content was created by AI and may contain errors. Always verify with official documentation. Not affiliated with Anthropic or any agent platform vendor.