Last updated: 2026-04-06

Local GPU Inference Setup — CUDA, Nemotron & VRAM Requirements

Most NemoClaw users connect to Claude or OpenAI. But if you have an NVIDIA GPU — whether in a local workstation or a GPU cloud instance — you can run inference entirely on your own hardware. No API costs, no data leaving your server, and latency measured in milliseconds rather than seconds. This guide covers everything from driver install to getting NemoClaw using your GPU.

You don't need a GPU to run NemoClaw

GPU inference is optional. A $10/month Hostinger VPS with Claude or OpenAI as the provider works great and costs less per month than a gaming GPU. Come back to this guide when you have hardware ready, or when your API bill gets large enough that local inference makes financial sense.

Why Local Inference?

Reason	Details
Privacy	Nothing leaves your machine — no prompts, no responses sent to a third-party API
Cost	GPU electricity cost is ~$0.02–0.05/hour; Opus API can cost $1+/hour under heavy use
Latency	Local 7B models return first token in <200ms; API models vary 500ms–3s
Rate limits	No provider rate limits — run as many requests as your GPU can handle
Offline use	Works without internet (once models are downloaded)

The trade-off: local models under 14B parameters are noticeably less capable than Claude Sonnet or GPT-4.1 for complex reasoning. For simple automation tasks (heartbeats, triage, summaries) they're excellent. For complex multi-step reasoning, route to a cloud model as a fallback.

Step 1 — Check GPU Compatibility

Local inference with NemoClaw requires an NVIDIA GPU with CUDA compute capability 7.0 or higher (Volta architecture or newer). Here's the VRAM requirement by use case:

VRAM	Suitable GPU examples	Models that fit	Quality tier
8 GB	RTX 3070, RTX 4060 Ti	Llama 3.2 3B (full), Qwen 2.5 7B (4-bit quant)	Good for simple tasks
16 GB	RTX 3080/4080, RTX 4080 Super, A4000	Qwen 2.5 14B (4-bit), Nemotron-Mini (full)	Solid assistant quality
24 GB	RTX 3090, RTX 4090, RTX 4090 Ti, A5000	Qwen 2.5 32B (4-bit), Nemotron-4-Mini (full), Llama 3.3 70B (2-bit)	Near-API quality
40–80 GB	A100, H100, L40S	Nemotron-4-340B (multi-GPU), Llama 3.3 70B (full)	Matches Claude Sonnet
Multi-GPU	2× H100, 4× A100	Nemotron-4-340B (full precision), frontier models	Top tier

Check your GPU's compute capability:

nvidia-smi --query-gpu=name,compute_cap,memory.total --format=csv
# Example output:
# NVIDIA GeForce RTX 4090, 8.9, 24564 MiB

Compute capability 8.9 (RTX 4090) is well above the 7.0 minimum. If nvidia-smi isn't found, proceed to Step 2.

Step 2 — Install NVIDIA Drivers

On Ubuntu 22.04 or 24.04 (the most common VPS and workstation OS for this):

# Add the NVIDIA driver PPA
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt update

# Install the recommended driver (545 or 550 as of 2026)
sudo apt install nvidia-driver-550

# Or let Ubuntu pick the right driver:
sudo ubuntu-drivers autoinstall

# Reboot required after driver install
sudo reboot

After reboot, verify:

nvidia-smi
# Should show driver version, CUDA version, and your GPU name
# Example:
# Driver Version: 550.54.15   CUDA Version: 12.4

Driver/CUDA mismatch is the #1 cause of inference failures

The CUDA version shown by nvidia-smi is the maximum supported CUDA runtime version — not what's installed. Install CUDA Toolkit 12.4 explicitly in Step 3. Don't assume it's there just because nvidia-smi shows a CUDA version.

Step 3 — Install CUDA Toolkit

Download from developer.nvidia.com/cuda-downloads. Select Linux → x86_64 → Ubuntu → 22.04 → deb (network):

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install cuda-toolkit-12-4

# Add CUDA to PATH
echo 'export PATH=/usr/local/cuda-12.4/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc

# Verify
nvcc --version
# Should show: Cuda compilation tools, release 12.4

Step 4 — Install Ollama with GPU Support

Ollama is the easiest way to run open models with GPU acceleration. It auto-detects CUDA at install time:

curl -fsSL https://ollama.com/install.sh | sh

# Verify GPU is detected
ollama run llama3.2:3b "Hello" 2>&1 | head -5
# Look for "Using GPU" in the startup output

Confirm GPU usage directly:

# In one terminal, start a model
ollama run qwen2.5:7b "Describe yourself"

# In another terminal, watch GPU utilisation
watch -n 0.5 nvidia-smi
# GPU-Util should jump to 80-100% during inference

Recommended Models by VRAM

# 8 GB VRAM — fast, lightweight
ollama pull llama3.2:3b
ollama pull qwen2.5:7b          # needs 4-bit quant at 8 GB

# 16 GB VRAM — good quality
ollama pull qwen2.5:14b
ollama pull mistral-small:22b   # 4-bit fits 16 GB

# 24 GB VRAM — near-API quality
ollama pull qwen2.5:32b
ollama pull qwen2.5:72b         # 4-bit fits 24 GB

Step 5 — Nemotron via NVIDIA NIM

For NVIDIA's Nemotron models specifically, NVIDIA Inference Microservices (NIM) provides optimised containers with better performance than raw Ollama. NIM requires a free NVIDIA developer account:

# Authenticate with NGC (NVIDIA GPU Cloud)
docker login nvcr.io
# Username: $oauthtoken
# Password: <your NGC API key from build.nvidia.com>

# Pull and run Nemotron Mini (4.1B — fits in 8 GB VRAM)
docker run --gpus all -p 8000:8000 \
  nvcr.io/nvidia/nemotron-mini-4b-instruct:latest

# Or Nemotron-4-Mini-Instruct (8B — needs 16 GB)
docker run --gpus all -p 8000:8000 \
  nvcr.io/nvidia/nemotron-4-mini-instruct:latest

NIM serves an OpenAI-compatible API on port 8000. This means you can point NemoClaw at it using the OpenAI provider type with a local URL:

# Add NIM as a provider in OpenShell
openShell provider add \
  --name nemotron-local \
  --type openai-compatible \
  --base-url http://localhost:8000/v1 \
  --model nemotron-mini-4b-instruct

# Route inference through this provider
openShell inference route set --provider nemotron-local

NIM vs Ollama — when to use which

Use NIM for Nemotron models — it's optimised specifically for them and produces better token rates. Use Ollama for everything else (Llama, Qwen, Mistral) — it has a larger model library and simpler management. Both serve OpenAI-compatible APIs and work identically from NemoClaw's perspective.

Step 6 — Configure NemoClaw for Local Inference

Once Ollama or NIM is running on the host, configure NemoClaw to use it. Remember: this config lives inside the sandbox (after claw connect <sandbox-name>):

# Step 1: Connect to the sandbox
claw connect nemoclaw

# Step 2: Add the local policy rule so the sandbox can reach localhost
# (exit sandbox first, add rule, reload, re-enter)
exit
cat >> ~/.openShell/policies/includes/local-inference.yaml << 'EOF'
allow:
  - host: "localhost"
    ports: [11434]   # Ollama default port
    comment: "Local Ollama inference"
  - host: "127.0.0.1"
    ports: [11434, 8000]  # Ollama + NIM
    comment: "Local inference endpoints"
EOF
openShell policy reload

# Step 3: Re-enter sandbox and configure OpenClaw
claw connect nemoclaw

# Step 4: Add Ollama as a provider inside the sandbox config
openclaw config set agents.defaults.model.primary "ollama/qwen2.5:14b"
openclaw config set agents.defaults.models '{"ollama/qwen2.5:14b":{"alias":"Local Qwen 14B"},"anthropic/claude-haiku-4-5":{"alias":"Haiku (cloud fallback)"}}'

# Step 5: Restart the gateway
openclaw gateway restart

Test it:

# Inside the sandbox
openclaw run "What model are you running on?"
# Should respond mentioning qwen or the local model name

Performance Expectations

GPU	Model	Tokens/sec (output)	Notes
RTX 4090 (24 GB)	Qwen 2.5 14B (full)	~80–100 tok/s	Fast — chat feels instant
RTX 4090 (24 GB)	Qwen 2.5 32B (4-bit)	~40–50 tok/s	Good — slight pause on long outputs
RTX 4080 (16 GB)	Qwen 2.5 14B (4-bit)	~60–75 tok/s	Good — nearly instant
RTX 3080 (10 GB)	Llama 3.2 3B (full)	~120 tok/s	Very fast but limited capability
A100 (80 GB)	Llama 3.3 70B (full)	~50–65 tok/s	Near-API quality at full speed
CPU only (no GPU)	Llama 3.2 3B	~5–15 tok/s	Usable for background tasks only

Numbers are approximate and vary by system RAM bandwidth, power mode, and temperature throttling.

Troubleshooting

Problem	Solution
CUDA not found / `nvcc: not found`	CUDA Toolkit not installed or not on PATH. Re-check Step 3 and verify `nvcc --version` after sourcing `.bashrc`
Ollama shows CPU inference (no GPU)	Run `ollama run llama3.2:3b --verbose` and look for the CUDA library loading. If missing, reinstall Ollama after CUDA is confirmed working
Out of memory (OOM) error	Model doesn't fit in VRAM. Pull a smaller model or use a quantized version (e.g. `qwen2.5:14b-q4_K_M`)
NemoClaw can't reach Ollama	Missing policy rule. Add `localhost:11434` to your OpenShell policy and reload
Driver/CUDA version conflict	Run `sudo apt install --reinstall nvidia-driver-550 cuda-toolkit-12-4` and reboot
nvidia-smi works but inference uses CPU	Check that CUDA libraries are on `LD_LIBRARY_PATH`. Run: `ldconfig -p \| grep libcuda` — should show paths

← Back to NemoClaw hub · See also: Switching Model Providers · Cost Optimisation Guide