Deep dive
SkillOpt: Train a Markdown Skill File Locally (No Fine-Tuning)
SkillOpt from Microsoft Research treats a markdown skill document as the trainable "weights" — improving it with a rollout → reflect → gate loop while the model itself never changes. This walkthrough installs SkillOpt, serves a local Qwen 3.5 4B model with vLLM, and trains a skill end-to-end on the ALFWorld household-task benchmark.
"SkillOpt: Microsoft's New Way to 'Train' AI Agents: Run Locally" by Fahd Mirza — Watch on YouTube →
What SkillOpt actually does
Instead of fine-tuning a model's weights, SkillOpt optimizes a plain-text skill document — a markdown file — using the same training vocabulary you'd use for a neural net: epochs, batch size, a learning-rate-like edit budget, and validation gates. The base model is frozen the whole time. What improves is the instructions it's given. A "skill" here is any capability you can describe in markdown (e.g. "handle Excel files this specific way for my business"), and that document plays the role of the weights.
Step-by-Step Breakdown
-
Understand the training loop before touching anything
Each step starts with a rollout: the target model (the one being improved) runs a batch of tasks using the current skill document as context. Then it reflects — a second model, the optimizer, looks at what went wrong and proposes patch edits to the document (the "backward pass"). Patches are aggregated to remove duplicates, then selected and clipped down to a budget — that edit budget is the effective "learning rate." The candidate skill is applied, then comes the gate: the candidate is scored on a held-out validation split and only kept if it beats the current skill, otherwise it's discarded. At the end of each epoch two extras kick in — a slow update (momentum across epochs, so the skill doesn't forget earlier learning) and a meta-skill (the optimizer's own memory of what kinds of edits tend to help).
-
Hardware and the local model
Demo runs on an Ubuntu box with a single NVIDIA RTX A6000 (48 GB VRAM). The target model is Qwen 3.5 4B, downloaded locally first.
-
Serve the model with vLLM
The local Qwen 3.5 4B is served through vLLM as an OpenAI-compatible endpoint. With the KV cache loaded, VRAM usage sits just over 44 GB.
-
Set the environment variables
Point SkillOpt at the model with a base URL (the local vLLM endpoint) and the model name. The optimizer model doesn't have to be the same one — any OpenAI-compatible model works: the same local model, OpenAI, Anthropic's API, or Azure. If you keep the same model, you just point the endpoint at it and flag it OpenAI-compatible.
-
Configure and verify SkillOpt
Configure the SkillOpt run (which model is the target, which is the optimizer) and verify the installation succeeded before training.
-
Pick a benchmark — ALFWorld
The test bench is ALFWorld (ALFRED): a text-based simulated home environment where an agent completes household tasks like "put a clean apple in the fridge" by navigating rooms and interacting with objects through text commands. Install it and download its data, then set the data path.
-
Clone the repo and launch training
Git-clone the SkillOpt repo, then launch the training loop on the ALFRED benchmark. The demo keeps it tiny — one epoch and a batch of four tasks — so a full loop completes quickly: the local Qwen model attempts the household tasks, and the optimizer model reviews those attempts and writes better instructions into the skill document.
-
Inspect the packaged skill file
The output is a single markdown skill document, organized by task type (pick-and-place, clean-and-place, heat-and-place, …), followed by general principles (e.g. "search each location only once," "grab visible objects immediately"). The most interesting part is a "hard search loop" — explicit rules that stop the agent getting stuck rechecking the same drawer or cabinet, the most common failure mode for these agents. Inside a protected slow-update block there's even more specific guidance learned across epochs, task by task (tips for items like pens, kettles, and sponges).
Common Errors & Fixes Covered
Why it happens: the demo's first launch failed because the system ran out of disk space during the run.
Fix: free up disk space and rerun the launch command — the loop completed on the second attempt.
Gotchas & Caveats
- The optimizer model must be OpenAI-compatible — a local vLLM endpoint, OpenAI, Anthropic's API, or Azure all work; you can even reuse the target model as its own optimizer.
- "Learning rate" here is not a numeric LR — it's the maximum number of edits allowed per step (the clip/budget on aggregated patches).
- A candidate skill is only kept if it beats the current one on a held-out validation split; failing candidates are rejected and discarded, so the skill only moves forward.
- The approach generalizes to any task with a measurable outcome — Excel formulas, legal letters, whatever — not just agent benchmarks.
- Running a 4B model under vLLM with KV cache already consumes ~44 GB VRAM here; budget your GPU accordingly, or point the target at a hosted endpoint instead.
Key Takeaways
- SkillOpt (Microsoft Research) optimizes a markdown skill document instead of fine-tuning weights — the base model never changes, only its instructions do.
- The loop mirrors neural-net training: epochs, batch size, an edit-budget "learning rate," momentum, and validation gates.
- A second "optimizer" model proposes patches to the skill doc (the backward pass); a validation gate keeps only genuine improvements.
- The walkthrough trains a local Qwen 3.5 4B (served via vLLM on an RTX A6000) on the ALFWorld household-task benchmark.
- The resulting file reads like a learned checklist — task-type sections, general principles, and anti-loop rules — which is exactly the shape of a good hand-written skill or instruction file.





