Published: 2026-06-05

Mellum 2: JetBrains’ 12B MoE Model with MCP Tool Use and Hermes Agent

Name: Mellum 2: JetBrains' 12B MoE Model with MCP Tool Use and Hermes Agent
Uploaded: 2026-06-05
Description: Fahd Mirza runs JetBrains' new Mellum 2 model locally via vLLM, connects it to an MCP filesystem server for real file operations, and tests it inside Hermes Agent.

Chapters / key moments (click to jump — plays here on the page)

Fahd Mirza demonstrates running JetBrains' new Mellum 2 model locally via vLLM, connecting it to an MCP filesystem server for real file read/write operations, and then testing it inside Hermes Agent. Mellum 2 is a 12B mixture-of-experts model released under Apache 2.0 — not a fine-tune, built from scratch — with tool use, multi-step agentic workflows, and a 131k token context window at roughly the compute cost of a 2.5B dense model.

Source video

"Mellum2: JetBrains' New Coding Model - vLLM + MCP Tool Use Locally" by Fahd Mirza — Watch on YouTube →

Key Takeaways

Mellum 2 is JetBrains' 12B mixture-of-experts model — Apache 2.0 licensed, built from scratch (not a fine-tune), with native tool use and a 131k token context window.
MoE architecture: 3 of 4 attention layers use sliding window attention (1024 tokens) for speed; the 4th uses full attention to preserve long-range context. Total compute cost is equivalent to a 2.5B dense model.
A thinking variant exposes chain-of-thought inside <think> tags; vLLM uses the Qwen 3 reasoning parser to surface it as a separate API field.
Connect to MCP via a simple JSON config + MCP CLI — Mellum 2 called tools autonomously (directory listing, file read, file write) without explicit prompting to do so.
Works inside Hermes Agent by selecting "local/custom endpoint" and pointing it at the vLLM server on localhost:8000.
Verdict: solid for code refactoring and bug finding; not the strongest single-shot code generator for highly constrained multi-requirement tasks.

Commands & Setup Steps

# 1. Serve Mellum 2 locally with vLLM (requires NVIDIA GPU with ~45GB VRAM)
#    Use the thinking variant from Hugging Face
vllm serve JetBrains/Mellum-2-12B-instruct-thinking \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser hermes \
  --port 8000

# 2. Install latest Transformers if model fails to load
pip install git+https://github.com/huggingface/transformers

# 3. Create MCP filesystem server config (mcp-server-config.json)
# Example config telling MCP CLI where the sandbox lives and how to start the server:
{
  "mcpServers": {
    "filesystem": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-filesystem", "/path/to/sandbox"]
    }
  }
}

# 4. Install MCP CLI and connect to Mellum 2
pip install mcp-cli
mcp chat --model http://localhost:8000/v1 --config mcp-server-config.json

# 5. Use Mellum 2 inside Hermes Agent
#    In Hermes setup: select "local" or "custom endpoint" as provider
#    Set model endpoint to http://localhost:8000
#    Hermes will auto-detect the Mellum 2 model

Mellum 2: JetBrains’ 12B MoE Model with MCP Tool Use and Hermes Agent

Key Takeaways

Commands & Setup Steps

More Hermes news

Go deeper: Hermes guides