Speed Up OpenClaw 2–3x With DFlash Speculative Decoding on Local GPU
DFlash is a speculative decoding inference engine that uses block diffusion—proposing entire token blocks at once rather than one token at a time—to deliver 2–3x faster generation on the same GPU hardware. Fahd Mirza shows how to serve DFlash as an OpenAI-compatible endpoint on port 8080 and point OpenClaw at it as a custom provider. With tool calling now supported, DFlash can back any agentic harness including OpenClaw, Hermes Agent, and Codex—completely locally with no API costs.
"Luce DFlash Meets OpenClaw - Local AI Agents at 2x Speed with Qwen3.6-27B" by Fahd Mirza — Watch on YouTube →
Key Takeaways
- DFlash serves as an OpenAI-compatible API on port 8080—point OpenClaw's custom provider URL to
http://localhost:8080with no API key required. - 2–3x speed gain over standard autoregressive inference on the same hardware using block diffusion speculative decoding.
- Tool calling now supported—Hermes Agent and Codex can also use DFlash as their local backend.
- 65k token context fits in ~20GB VRAM using 3-bit KV cache compression (TQ3_0 flag) with a speculation budget of 8.
- Setup: clone DFlash repo → conda env → build → serve with KV cache flags → install OpenClaw → set custom provider to localhost:8080.
Commands & Code Mentioned
git clone https://github.com/dflash-ai/dflash
conda create -n dflash python=3.10 && conda activate dflash
# Build DFlash (see repo README for full steps)
# Serve on port 8080 with 3-bit KV cache:
DFLASH_KV_TYPE=TQ3_0 DFLASH_PREFILL_UBATCH=512 \
dflash serve --model luce-dflash --ctx 65000 \
--speculation-budget 8 --port 8080
# Install OpenClaw, then configure custom provider:
openclaw setup --provider custom --base-url http://localhost:8080