Last updated: 2026-04-06

Local GPU Inference Setup — CUDA, Nemotron & VRAM Requirements

Most NemoClaw users connect to Claude or OpenAI. But if you have an NVIDIA GPU — whether in a local workstation or a GPU cloud instance — you can run inference entirely on your own hardware. No API costs, no data leaving your server, and latency measured in milliseconds rather than seconds. This guide covers everything from driver install to getting NemoClaw using your GPU.

You don't need a GPU to run NemoClaw

GPU inference is optional. A $10/month Hostinger VPS with Claude or OpenAI as the provider works great and costs less per month than a gaming GPU. Come back to this guide when you have hardware ready, or when your API bill gets large enough that local inference makes financial sense.

Why Local Inference?

ReasonDetails
PrivacyNothing leaves your machine — no prompts, no responses sent to a third-party API
CostGPU electricity cost is ~$0.02–0.05/hour; Opus API can cost $1+/hour under heavy use
LatencyLocal 7B models return first token in <200ms; API models vary 500ms–3s
Rate limitsNo provider rate limits — run as many requests as your GPU can handle
Offline useWorks without internet (once models are downloaded)

The trade-off: local models under 14B parameters are noticeably less capable than Claude Sonnet or GPT-4.1 for complex reasoning. For simple automation tasks (heartbeats, triage, summaries) they're excellent. For complex multi-step reasoning, route to a cloud model as a fallback.

Step 1 — Check GPU Compatibility

Local inference with NemoClaw requires an NVIDIA GPU with CUDA compute capability 7.0 or higher (Volta architecture or newer). Here's the VRAM requirement by use case:

VRAMSuitable GPU examplesModels that fitQuality tier
8 GBRTX 3070, RTX 4060 TiLlama 3.2 3B (full), Qwen 2.5 7B (4-bit quant)Good for simple tasks
16 GBRTX 3080/4080, RTX 4080 Super, A4000Qwen 2.5 14B (4-bit), Nemotron-Mini (full)Solid assistant quality
24 GBRTX 3090, RTX 4090, RTX 4090 Ti, A5000Qwen 2.5 32B (4-bit), Nemotron-4-Mini (full), Llama 3.3 70B (2-bit)Near-API quality
40–80 GBA100, H100, L40SNemotron-4-340B (multi-GPU), Llama 3.3 70B (full)Matches Claude Sonnet
Multi-GPU2× H100, 4× A100Nemotron-4-340B (full precision), frontier modelsTop tier

Check your GPU's compute capability:

nvidia-smi --query-gpu=name,compute_cap,memory.total --format=csv
# Example output:
# NVIDIA GeForce RTX 4090, 8.9, 24564 MiB

Compute capability 8.9 (RTX 4090) is well above the 7.0 minimum. If nvidia-smi isn't found, proceed to Step 2.

Step 2 — Install NVIDIA Drivers

On Ubuntu 22.04 or 24.04 (the most common VPS and workstation OS for this):

# Add the NVIDIA driver PPA
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt update

# Install the recommended driver (545 or 550 as of 2026)
sudo apt install nvidia-driver-550

# Or let Ubuntu pick the right driver:
sudo ubuntu-drivers autoinstall

# Reboot required after driver install
sudo reboot

After reboot, verify:

nvidia-smi
# Should show driver version, CUDA version, and your GPU name
# Example:
# Driver Version: 550.54.15   CUDA Version: 12.4
Driver/CUDA mismatch is the #1 cause of inference failures

The CUDA version shown by nvidia-smi is the maximum supported CUDA runtime version — not what's installed. Install CUDA Toolkit 12.4 explicitly in Step 3. Don't assume it's there just because nvidia-smi shows a CUDA version.

Step 3 — Install CUDA Toolkit

Download from developer.nvidia.com/cuda-downloads. Select Linux → x86_64 → Ubuntu → 22.04 → deb (network):

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install cuda-toolkit-12-4

# Add CUDA to PATH
echo 'export PATH=/usr/local/cuda-12.4/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc

# Verify
nvcc --version
# Should show: Cuda compilation tools, release 12.4

Step 4 — Install Ollama with GPU Support

Ollama is the easiest way to run open models with GPU acceleration. It auto-detects CUDA at install time:

curl -fsSL https://ollama.com/install.sh | sh

# Verify GPU is detected
ollama run llama3.2:3b "Hello" 2>&1 | head -5
# Look for "Using GPU" in the startup output

Confirm GPU usage directly:

# In one terminal, start a model
ollama run qwen2.5:7b "Describe yourself"

# In another terminal, watch GPU utilisation
watch -n 0.5 nvidia-smi
# GPU-Util should jump to 80-100% during inference

Recommended Models by VRAM

# 8 GB VRAM — fast, lightweight
ollama pull llama3.2:3b
ollama pull qwen2.5:7b          # needs 4-bit quant at 8 GB

# 16 GB VRAM — good quality
ollama pull qwen2.5:14b
ollama pull mistral-small:22b   # 4-bit fits 16 GB

# 24 GB VRAM — near-API quality
ollama pull qwen2.5:32b
ollama pull qwen2.5:72b         # 4-bit fits 24 GB

Step 5 — Nemotron via NVIDIA NIM

For NVIDIA's Nemotron models specifically, NVIDIA Inference Microservices (NIM) provides optimised containers with better performance than raw Ollama. NIM requires a free NVIDIA developer account:

# Authenticate with NGC (NVIDIA GPU Cloud)
docker login nvcr.io
# Username: $oauthtoken
# Password: <your NGC API key from build.nvidia.com>

# Pull and run Nemotron Mini (4.1B — fits in 8 GB VRAM)
docker run --gpus all -p 8000:8000 \
  nvcr.io/nvidia/nemotron-mini-4b-instruct:latest

# Or Nemotron-4-Mini-Instruct (8B — needs 16 GB)
docker run --gpus all -p 8000:8000 \
  nvcr.io/nvidia/nemotron-4-mini-instruct:latest

NIM serves an OpenAI-compatible API on port 8000. This means you can point NemoClaw at it using the OpenAI provider type with a local URL:

# Add NIM as a provider in OpenShell
openShell provider add \
  --name nemotron-local \
  --type openai-compatible \
  --base-url http://localhost:8000/v1 \
  --model nemotron-mini-4b-instruct

# Route inference through this provider
openShell inference route set --provider nemotron-local
NIM vs Ollama — when to use which

Use NIM for Nemotron models — it's optimised specifically for them and produces better token rates. Use Ollama for everything else (Llama, Qwen, Mistral) — it has a larger model library and simpler management. Both serve OpenAI-compatible APIs and work identically from NemoClaw's perspective.

Step 6 — Configure NemoClaw for Local Inference

Once Ollama or NIM is running on the host, configure NemoClaw to use it. Remember: this config lives inside the sandbox (after claw connect <sandbox-name>):

# Step 1: Connect to the sandbox
claw connect nemoclaw

# Step 2: Add the local policy rule so the sandbox can reach localhost
# (exit sandbox first, add rule, reload, re-enter)
exit
cat >> ~/.openShell/policies/includes/local-inference.yaml << 'EOF'
allow:
  - host: "localhost"
    ports: [11434]   # Ollama default port
    comment: "Local Ollama inference"
  - host: "127.0.0.1"
    ports: [11434, 8000]  # Ollama + NIM
    comment: "Local inference endpoints"
EOF
openShell policy reload

# Step 3: Re-enter sandbox and configure OpenClaw
claw connect nemoclaw

# Step 4: Add Ollama as a provider inside the sandbox config
openclaw config set agents.defaults.model.primary "ollama/qwen2.5:14b"
openclaw config set agents.defaults.models '{"ollama/qwen2.5:14b":{"alias":"Local Qwen 14B"},"anthropic/claude-haiku-4-5":{"alias":"Haiku (cloud fallback)"}}'

# Step 5: Restart the gateway
openclaw gateway restart

Test it:

# Inside the sandbox
openclaw run "What model are you running on?"
# Should respond mentioning qwen or the local model name

Performance Expectations

GPUModelTokens/sec (output)Notes
RTX 4090 (24 GB)Qwen 2.5 14B (full)~80–100 tok/sFast — chat feels instant
RTX 4090 (24 GB)Qwen 2.5 32B (4-bit)~40–50 tok/sGood — slight pause on long outputs
RTX 4080 (16 GB)Qwen 2.5 14B (4-bit)~60–75 tok/sGood — nearly instant
RTX 3080 (10 GB)Llama 3.2 3B (full)~120 tok/sVery fast but limited capability
A100 (80 GB)Llama 3.3 70B (full)~50–65 tok/sNear-API quality at full speed
CPU only (no GPU)Llama 3.2 3B~5–15 tok/sUsable for background tasks only

Numbers are approximate and vary by system RAM bandwidth, power mode, and temperature throttling.

Troubleshooting

ProblemSolution
CUDA not found / nvcc: not foundCUDA Toolkit not installed or not on PATH. Re-check Step 3 and verify nvcc --version after sourcing .bashrc
Ollama shows CPU inference (no GPU)Run ollama run llama3.2:3b --verbose and look for the CUDA library loading. If missing, reinstall Ollama after CUDA is confirmed working
Out of memory (OOM) errorModel doesn't fit in VRAM. Pull a smaller model or use a quantized version (e.g. qwen2.5:14b-q4_K_M)
NemoClaw can't reach OllamaMissing policy rule. Add localhost:11434 to your OpenShell policy and reload
Driver/CUDA version conflictRun sudo apt install --reinstall nvidia-driver-550 cuda-toolkit-12-4 and reboot
nvidia-smi works but inference uses CPUCheck that CUDA libraries are on LD_LIBRARY_PATH. Run: ldconfig -p | grep libcuda — should show paths

← Back to NemoClaw hub · See also: Switching Model Providers · Cost Optimisation Guide