Published: 2026-06-02

Adaptive PFlash + Hermes Agent: Self-Tuning Prefill on a Single GPU

The PFlash prefill-acceleration layer of the DFlash stack just gained adaptive compression — instead of manually setting a keep-ratio parameter, it now watches each Hermes session in real time and tunes itself automatically. Fahd Mirza walks through pulling the latest code, rebuilding the binary, and wiring the DFlash server into Hermes agent, then shows PFlash compressing 3,572 prefill tokens down to just 148 on a single Nvidia RTX 6000 GPU.

Source video

"Adaptive PFlash + Hermes Agent - Self-Tuning Prefill on a Single GPU Locally" by Fahd MirzaWatch on YouTube →

Key Takeaways

  • PFlash adaptive mode replaces the manual keep_ratio parameter — it observes actual acceptance rates in real time and adjusts compression automatically per session.
  • Hermes agent is an ideal PFlash target because every turn sends a long system prompt plus full conversation history; PFlash compressed this from 3,572 tokens to 148 in the demo.
  • The DFlash repo structure changed in a recent PR — CMakeLists.txt moved into the server/ directory; existing builds need to be rebuilt from the updated path.
  • Hardware used: Ubuntu, Nvidia RTX 6000 (48 GB VRAM); a small ~6B quantized drafter model handles the token-scoring work for PFlash.
  • A minimum context length must be set in Hermes config when using PFlash — increase it via the set command or edit the config file directly before restarting.

Commands & Code Mentioned

# Pull and rebuild DFlash (run from the lucifix-hub root)
git pull
cd server
cmake ..
make -j$(nproc)

# Start DFlash server with adaptive PFlash enabled
./dlash_server --pflash=auto --bsa=on

# Update Hermes config to point at local DFlash server
# Edit .hermes/config — change endpoint to http://localhost:
# and set model to the DFlash drafter model name

# Start Hermes
hermes

# If minimum context length error appears, set it first:
hermes set context_length 4096

How PFlash Fits the DFlash Stack

DFlash combines two acceleration tracks in one binary:

  • DFlash (decode side): Speculative decoding — a fast draft model proposes 16 tokens at once using block diffusion; the large model verifies all 16 in a single forward pass. Delivers ~136 tokens/sec on Gemma 4 31B vs 26 without it.
  • PFlash (prefill side): A small ~6B drafter scores which tokens actually matter during prefill; the large model only processes the top 5% of surviving tokens. Reduces a 128k-token prefill from ~4 minutes to ~25 seconds.

The new adaptive mode removes the last manual knob — no keep_ratio tuning needed. If compression is hurting quality the algorithm backs off; if it's conservative it tightens automatically.

Related Hermes Guides