Published: 2026-06-21
Run a Local Coding Agent: Qwen 3.6 27B (Pi-Reasoning GGUF) in Hermes
Chapters / key moments (click to jump — plays here on the page)
Fahd Mirza installs a Pi-Reasoning fine-tune of Qwen 3.6 27B (Q4_K_M GGUF) with llama.cpp, wires it into Hermes agent, and tests it on bug-fixing, creative coding, and writing. The model uses baked-in multi-token prediction (MTP) plus speculative decoding for speed, fits in ~20GB of VRAM, and held up well on a real full-stack bug fix.
Source video
"Qwen3.6 27B (Pi-Reasoning GGUF) - Fine-Tuned for Local Heavy AI Agent" by Fahd Mirza — Watch on YouTube →
Key Takeaways
- The model is a fine-tune of Qwen 3.6 27B trained on real successful coding-agent sessions (including the step-by-step reasoning, not just final answers) — aimed at agentic tasks: reading files, running terminal commands, writing fixes, self-checking.
- Q4_K_M quantization runs in just over 20GB of VRAM (tested on an RTX A6000), so it fits a 24GB card; drop the context length if you need more headroom.
- Multi-token prediction (MTP) is baked into the weights via extra prediction heads, so the model predicts several tokens per pass; speculative decoding verifies them in one pass (Fahd saw ~82% draft acceptance live).
- It fixed a planted bug in a full-stack app end-to-end through Hermes agent, and stayed coherent on creative coding (an animated procedurally-generated tree) — no quantization loop or hallucination collapse.
- Served via llama.cpp with speculative-decoding + MTP flags, 128k context, flash attention, and a Q4_0 KV cache to save VRAM.
Commands & Code Mentioned
# Serve the GGUF with llama.cpp (flags explained in the video):
llama-server -m qwen3.6-27b-pi-reasoning-Q4_K_M.gguf \
--spec-type draft --draft 3 --draft-max 3 \ # speculative decoding via built-in MTP heads (draft 3 tokens ahead)
-ngl 99 \ # offload all layers to the GPU
-fa \ # flash attention
--cache-type-k q4_0 --cache-type-v q4_0 \ # quantize KV cache to save VRAM
-c 128000 \ # 128k context window
--jinja # proper chat / tool-call formatting





