Local GPU Inference Setup — CUDA, Nemotron & VRAM Requirements
Most NemoClaw users connect to Claude or OpenAI. But if you have an NVIDIA GPU — whether in a local workstation or a GPU cloud instance — you can run inference entirely on your own hardware. No API costs, no data leaving your server, and latency measured in milliseconds rather than seconds. This guide covers everything from driver install to getting NemoClaw using your GPU.
GPU inference is optional. A $10/month Hostinger VPS with Claude or OpenAI as the provider works great and costs less per month than a gaming GPU. Come back to this guide when you have hardware ready, or when your API bill gets large enough that local inference makes financial sense.
Why Local Inference?
| Reason | Details |
|---|---|
| Privacy | Nothing leaves your machine — no prompts, no responses sent to a third-party API |
| Cost | GPU electricity cost is ~$0.02–0.05/hour; Opus API can cost $1+/hour under heavy use |
| Latency | Local 7B models return first token in <200ms; API models vary 500ms–3s |
| Rate limits | No provider rate limits — run as many requests as your GPU can handle |
| Offline use | Works without internet (once models are downloaded) |
The trade-off: local models under 14B parameters are noticeably less capable than Claude Sonnet or GPT-4.1 for complex reasoning. For simple automation tasks (heartbeats, triage, summaries) they're excellent. For complex multi-step reasoning, route to a cloud model as a fallback.
Step 1 — Check GPU Compatibility
Local inference with NemoClaw requires an NVIDIA GPU with CUDA compute capability 7.0 or higher (Volta architecture or newer). Here's the VRAM requirement by use case:
| VRAM | Suitable GPU examples | Models that fit | Quality tier |
|---|---|---|---|
| 8 GB | RTX 3070, RTX 4060 Ti | Llama 3.2 3B (full), Qwen 2.5 7B (4-bit quant) | Good for simple tasks |
| 16 GB | RTX 3080/4080, RTX 4080 Super, A4000 | Qwen 2.5 14B (4-bit), Nemotron-Mini (full) | Solid assistant quality |
| 24 GB | RTX 3090, RTX 4090, RTX 4090 Ti, A5000 | Qwen 2.5 32B (4-bit), Nemotron-4-Mini (full), Llama 3.3 70B (2-bit) | Near-API quality |
| 40–80 GB | A100, H100, L40S | Nemotron-4-340B (multi-GPU), Llama 3.3 70B (full) | Matches Claude Sonnet |
| Multi-GPU | 2× H100, 4× A100 | Nemotron-4-340B (full precision), frontier models | Top tier |
Check your GPU's compute capability:
nvidia-smi --query-gpu=name,compute_cap,memory.total --format=csv
# Example output:
# NVIDIA GeForce RTX 4090, 8.9, 24564 MiB
Compute capability 8.9 (RTX 4090) is well above the 7.0 minimum. If nvidia-smi isn't found, proceed to Step 2.
Step 2 — Install NVIDIA Drivers
On Ubuntu 22.04 or 24.04 (the most common VPS and workstation OS for this):
# Add the NVIDIA driver PPA
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt update
# Install the recommended driver (545 or 550 as of 2026)
sudo apt install nvidia-driver-550
# Or let Ubuntu pick the right driver:
sudo ubuntu-drivers autoinstall
# Reboot required after driver install
sudo reboot
After reboot, verify:
nvidia-smi
# Should show driver version, CUDA version, and your GPU name
# Example:
# Driver Version: 550.54.15 CUDA Version: 12.4
The CUDA version shown by nvidia-smi is the maximum supported CUDA runtime version — not what's installed. Install CUDA Toolkit 12.4 explicitly in Step 3. Don't assume it's there just because nvidia-smi shows a CUDA version.
Step 3 — Install CUDA Toolkit
Download from developer.nvidia.com/cuda-downloads. Select Linux → x86_64 → Ubuntu → 22.04 → deb (network):
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install cuda-toolkit-12-4
# Add CUDA to PATH
echo 'export PATH=/usr/local/cuda-12.4/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc
# Verify
nvcc --version
# Should show: Cuda compilation tools, release 12.4
Step 4 — Install Ollama with GPU Support
Ollama is the easiest way to run open models with GPU acceleration. It auto-detects CUDA at install time:
curl -fsSL https://ollama.com/install.sh | sh
# Verify GPU is detected
ollama run llama3.2:3b "Hello" 2>&1 | head -5
# Look for "Using GPU" in the startup output
Confirm GPU usage directly:
# In one terminal, start a model
ollama run qwen2.5:7b "Describe yourself"
# In another terminal, watch GPU utilisation
watch -n 0.5 nvidia-smi
# GPU-Util should jump to 80-100% during inference
Recommended Models by VRAM
# 8 GB VRAM — fast, lightweight
ollama pull llama3.2:3b
ollama pull qwen2.5:7b # needs 4-bit quant at 8 GB
# 16 GB VRAM — good quality
ollama pull qwen2.5:14b
ollama pull mistral-small:22b # 4-bit fits 16 GB
# 24 GB VRAM — near-API quality
ollama pull qwen2.5:32b
ollama pull qwen2.5:72b # 4-bit fits 24 GB
Step 5 — Nemotron via NVIDIA NIM
For NVIDIA's Nemotron models specifically, NVIDIA Inference Microservices (NIM) provides optimised containers with better performance than raw Ollama. NIM requires a free NVIDIA developer account:
# Authenticate with NGC (NVIDIA GPU Cloud)
docker login nvcr.io
# Username: $oauthtoken
# Password: <your NGC API key from build.nvidia.com>
# Pull and run Nemotron Mini (4.1B — fits in 8 GB VRAM)
docker run --gpus all -p 8000:8000 \
nvcr.io/nvidia/nemotron-mini-4b-instruct:latest
# Or Nemotron-4-Mini-Instruct (8B — needs 16 GB)
docker run --gpus all -p 8000:8000 \
nvcr.io/nvidia/nemotron-4-mini-instruct:latest
NIM serves an OpenAI-compatible API on port 8000. This means you can point NemoClaw at it using the OpenAI provider type with a local URL:
# Add NIM as a provider in OpenShell
openShell provider add \
--name nemotron-local \
--type openai-compatible \
--base-url http://localhost:8000/v1 \
--model nemotron-mini-4b-instruct
# Route inference through this provider
openShell inference route set --provider nemotron-local
Use NIM for Nemotron models — it's optimised specifically for them and produces better token rates. Use Ollama for everything else (Llama, Qwen, Mistral) — it has a larger model library and simpler management. Both serve OpenAI-compatible APIs and work identically from NemoClaw's perspective.
Step 6 — Configure NemoClaw for Local Inference
Once Ollama or NIM is running on the host, configure NemoClaw to use it. Remember: this config lives inside the sandbox (after claw connect <sandbox-name>):
# Step 1: Connect to the sandbox
claw connect nemoclaw
# Step 2: Add the local policy rule so the sandbox can reach localhost
# (exit sandbox first, add rule, reload, re-enter)
exit
cat >> ~/.openShell/policies/includes/local-inference.yaml << 'EOF'
allow:
- host: "localhost"
ports: [11434] # Ollama default port
comment: "Local Ollama inference"
- host: "127.0.0.1"
ports: [11434, 8000] # Ollama + NIM
comment: "Local inference endpoints"
EOF
openShell policy reload
# Step 3: Re-enter sandbox and configure OpenClaw
claw connect nemoclaw
# Step 4: Add Ollama as a provider inside the sandbox config
openclaw config set agents.defaults.model.primary "ollama/qwen2.5:14b"
openclaw config set agents.defaults.models '{"ollama/qwen2.5:14b":{"alias":"Local Qwen 14B"},"anthropic/claude-haiku-4-5":{"alias":"Haiku (cloud fallback)"}}'
# Step 5: Restart the gateway
openclaw gateway restart
Test it:
# Inside the sandbox
openclaw run "What model are you running on?"
# Should respond mentioning qwen or the local model name
Performance Expectations
| GPU | Model | Tokens/sec (output) | Notes |
|---|---|---|---|
| RTX 4090 (24 GB) | Qwen 2.5 14B (full) | ~80–100 tok/s | Fast — chat feels instant |
| RTX 4090 (24 GB) | Qwen 2.5 32B (4-bit) | ~40–50 tok/s | Good — slight pause on long outputs |
| RTX 4080 (16 GB) | Qwen 2.5 14B (4-bit) | ~60–75 tok/s | Good — nearly instant |
| RTX 3080 (10 GB) | Llama 3.2 3B (full) | ~120 tok/s | Very fast but limited capability |
| A100 (80 GB) | Llama 3.3 70B (full) | ~50–65 tok/s | Near-API quality at full speed |
| CPU only (no GPU) | Llama 3.2 3B | ~5–15 tok/s | Usable for background tasks only |
Numbers are approximate and vary by system RAM bandwidth, power mode, and temperature throttling.
Troubleshooting
| Problem | Solution |
|---|---|
CUDA not found / nvcc: not found | CUDA Toolkit not installed or not on PATH. Re-check Step 3 and verify nvcc --version after sourcing .bashrc |
| Ollama shows CPU inference (no GPU) | Run ollama run llama3.2:3b --verbose and look for the CUDA library loading. If missing, reinstall Ollama after CUDA is confirmed working |
| Out of memory (OOM) error | Model doesn't fit in VRAM. Pull a smaller model or use a quantized version (e.g. qwen2.5:14b-q4_K_M) |
| NemoClaw can't reach Ollama | Missing policy rule. Add localhost:11434 to your OpenShell policy and reload |
| Driver/CUDA version conflict | Run sudo apt install --reinstall nvidia-driver-550 cuda-toolkit-12-4 and reboot |
| nvidia-smi works but inference uses CPU | Check that CUDA libraries are on LD_LIBRARY_PATH. Run: ldconfig -p | grep libcuda — should show paths |
← Back to NemoClaw hub · See also: Switching Model Providers · Cost Optimisation Guide