Local AI by VRAM Tier - 8GB | 12GB | 16GB | 24GB | 32GB
Deep dives: Chat | Coding | Translation | Vision | Speech | Agents
AI agents that browse the web, query databases, control smart home devices, and chain tools together to solve problems - all running on your own GPU. That’s the promise. The reality depends on whether your model can reliably call functions without hallucinating parameters or losing track of multi-step plans.
Function calling is the hardest task for small models. A chat model can get away with an imprecise answer. A coding model can produce code that’s mostly right. But a model calling a function with the wrong argument crashes the workflow. There’s no partial credit.
We tested local models on tool use benchmarks to find which ones actually work as agents at each VRAM tier.
The Benchmarks That Matter
- BFCL V4 - Berkeley Function Calling Leaderboard. Tests single-turn and multi-turn function calls, serial and parallel execution, and multi-language support. The standard benchmark for tool use reliability
- TAU2-bench - Real-world tool-augmented tasks. Tests whether models can complete end-to-end workflows using tools
- ScreenSpot Pro - GUI understanding and interaction. Can the model navigate a desktop application?
- OSWorld-Verified - Desktop automation tasks. The hardest agentic benchmark - complete real tasks on a real operating system
The Models
| Model | Params | BFCL V4 | TAU2-bench | VRAM (Q4) | Tool Call Format |
|---|---|---|---|---|---|
| Qwen 3.5 9B | 9B | 66.1 | 79.1 | ~7 GB | Native Qwen tool format |
| Qwen 3.5 4B | 4B | - | - | 3.4 GB | Native Qwen tool format |
| Qwen 3 8B | 8B | ~55 | - | 6.5 GB | Native Qwen tool format |
| Qwen 3 14B | 14B | ~62 | - | 10.7 GB | Native Qwen tool format |
| Qwen 3 32B | 32B | ~68 | - | 22.2 GB | Native Qwen tool format |
| Qwen 3 30B MoE | 30B (3B active) | - | - | ~18 GB | Native Qwen tool format |
| EXAONE 4.0 32B | 32B | - | - | ~22 GB | OpenAI-compatible |
| Llama 3.1 8B | 8B | ~48 | - | 6.2 GB | Llama tool format |
| Mistral Small 22B | 22B | - | - | ~14 GB | Mistral tool format |
For context, GPT-5 mini scores 55.5 on BFCL V4. Qwen 3.5 9B’s 66.1 beats it by 19%. On TAU2-bench (79.1), it outperforms models 3-13x its size. This is the strongest small model result for agent tasks.
What Makes a Good Agent Model?
Beyond benchmarks, practical agent reliability depends on:
- Structured output - does it format JSON function calls correctly, every time?
- Stop token discipline - does it stop after the function call, or keep generating text that breaks parsing?
- Multi-step planning - can it chain 3+ tool calls without losing track of the goal?
- Error recovery - when a tool returns an error, does it retry with corrected parameters or spiral?
- Temperature sensitivity - at temperature 0, is the output deterministic and parseable?
The Qwen models excel at points 1-3. They use a well-defined tool call template that frameworks parse easily. Llama models are more variable - they work, but require more careful prompt engineering for reliable structured output.
8GB VRAM {#8gb}
GPUs: RTX 4060, RTX 3060 8GB, RTX 3070
Candidates
| Model | BFCL V4 | TAU2-bench | VRAM | Reliability |
|---|---|---|---|---|
| Qwen 3.5 9B | 66.1 | 79.1 | ~7 GB | High |
| Qwen 3 8B | ~55 | - | 6.5 GB | Good |
| Qwen 3.5 4B | - | - | 3.4 GB | Moderate |
| Llama 3.1 8B | ~48 | - | 6.2 GB | Variable |
Winner: Qwen 3.5 9B
The Qwen 3.5 9B is the clear pick. At 66.1 BFCL V4, it outperforms GPT-5 mini on function calling while fitting in 7GB. The TAU2-bench score (79.1) confirms it can handle real multi-step workflows, not just synthetic benchmarks.
Both “thinking” and “non-thinking” modes are supported. Use thinking mode when the agent needs to reason about which tool to call. Use non-thinking mode for speed when the tool choice is obvious.
ollama pull qwen3.5:9b
Budget option: Qwen 3.5 4B at 3.4GB. The function calling works but with lower reliability on complex chains. Good enough for simple automations - “check the weather and send me a notification” - but not for multi-step workflows.
The honest take: 8GB agent models can handle 2-3 step tool chains reliably. Beyond that, they start losing context and making mistakes. For simple automations (smart home control, single API calls, basic data retrieval), they work well. For complex multi-step agents (research tasks, multi-tool workflows), you need more VRAM.
For all use cases at this level, see our 8GB VRAM complete guide.
12GB VRAM {#12gb}
GPUs: RTX 3060 12GB, RTX 4070
Candidates
| Model | BFCL V4 | VRAM | Speed |
|---|---|---|---|
| Qwen 3 14B | ~62 | 10.7 GB | ~50 tok/s |
| Qwen 3.5 9B | 66.1 | ~7 GB | ~55 tok/s |
| Phi-4 14B | - | ~10 GB | 25-35 tok/s |
Winner: Qwen 3 14B
At 12GB, you can run the 14B model with decent context headroom. The Qwen 3 14B has native tool calling support and more consistent structured output than the 8B. The extra parameters translate directly to better multi-step planning - it can maintain a 4-5 step tool chain without losing the thread.
Alternative: Qwen 3.5 9B with 5GB of context headroom. If your agent tasks involve large tool responses (database queries, web page contents), the extra context buffer matters more than the 14B’s marginal quality improvement.
For all use cases at this level, see our 12GB VRAM complete guide.
16GB VRAM {#16gb}
GPUs: RTX 4060 Ti 16GB, RTX 5060, Intel Arc A770
Candidates
| Model | BFCL V4 | VRAM | Best For |
|---|---|---|---|
| Qwen 3 14B (Q5) | ~62 | ~12 GB | Higher quant, better reliability |
| GPT-OSS 20B | - | ~14 GB | OpenAI-format tool calls |
| Mistral Small 22B | - | ~14 GB | Multi-tool chains |
| Qwen 3.5 9B | 66.1 | ~7 GB | Speed + room for tools |
Winner: Qwen 3 14B at Q5_K_M
The quality bump from Q4 to Q5 quantization matters more for agents than for chat. At Q5_K_M (~12GB), function call formatting becomes more reliable - fewer malformed JSON outputs, better parameter accuracy. The 4GB headroom on a 16GB card gives you room for tool response context.
OpenAI-compatible: GPT-OSS 20B at ~14GB. If your agent framework expects OpenAI-format function calls (most do), GPT-OSS speaks that format natively. The speed advantage (~140 tok/s) means faster tool chain execution.
Multi-tool specialist: Mistral Small 22B at ~14GB. Mistral models have strong parallel function calling - they can decide to call multiple tools simultaneously when the calls are independent. If your workflow involves “fetch weather AND check calendar AND query database” in parallel, Mistral handles this pattern well.
For all use cases at this level, see our 16GB VRAM complete guide.
24GB VRAM {#24gb}
GPUs: RTX 3090, RTX 4090
This is where local agents get serious. The 32B models can maintain complex multi-step workflows and handle long tool response chains.
Candidates
| Model | BFCL V4 | VRAM | Speed (RTX 4090) |
|---|---|---|---|
| Qwen 3 32B | ~68 | 22.2 GB | 34 tok/s |
| Qwen 3 30B MoE | - | ~18 GB | 196 tok/s |
| EXAONE 4.0 32B | - | ~22 GB | ~30 tok/s |
| Gemma 3 27B QAT | - | ~14 GB | ~40 tok/s |
Winner: Qwen 3 30B MoE
For agents, speed matters as much as quality. Each tool call is a round trip - the model generates the call, the tool executes, the response comes back, the model processes it. A 5-step workflow means 5+ inference passes. At 196 tok/s on the RTX 4090, the Qwen 3 30B MoE completes these chains in seconds rather than minutes.
The MoE architecture activates only 3B parameters per token, so it’s fast, but the full 30B parameter space gives it broad tool-calling knowledge. At ~18GB, you have 6GB free for context - enough for complex tool responses.
Quality ceiling: Qwen 3 32B dense at ~68 BFCL V4. When reliability matters more than speed (financial transactions, critical automations), the dense model’s higher accuracy is worth the 6x speed penalty.
Dark horse: EXAONE 4.0 32B. LG’s model has strong reasoning capabilities (92.3 MMLU-Redux) which translates to better planning in multi-step agent tasks. Less community testing for agents specifically, but worth trying if Qwen doesn’t fit your workflow.
For all use cases at this level, see our 24GB VRAM complete guide.
32GB VRAM {#32gb}
GPUs: RTX 5090
Candidates
| Model | BFCL V4 | VRAM | Notes |
|---|---|---|---|
| Qwen 3 32B (Q6) | ~68 | ~28 GB | Near-lossless agent quality |
| Qwen 3 30B MoE (Q8) | - | ~28 GB | Maximum speed + quality |
| Llama 3.3 70B (Q3) | - | ~32 GB | Massive context understanding |
Winner: Qwen 3 30B MoE at Q8_0
Near-lossless MoE at estimated ~320 tok/s on the RTX 5090. Complex 10-step agent workflows complete in seconds. The Q8 quantization eliminates virtually all accuracy loss from compression.
Complex planning: Qwen 3 32B at Q6_K (~28GB). When your agent needs to plan and reason deeply before acting - research tasks, data analysis pipelines, multi-document processing - the dense model’s stronger reasoning justifies the slower speed.
For all use cases at this level, see our 32GB VRAM complete guide.
Cross-Tier Summary
| Tier | Best Pick | BFCL V4 | Speed | Practical Chain Length |
|---|---|---|---|---|
| 8GB | Qwen 3.5 9B | 66.1 | ~40 tok/s | 2-3 steps |
| 12GB | Qwen 3 14B | ~62 | ~50 tok/s | 4-5 steps |
| 16GB | Qwen 3 14B (Q5) | ~62 | ~55 tok/s | 4-5 steps |
| 24GB | Qwen 3 30B MoE | - | 196 tok/s | 8-10 steps |
| 32GB | Qwen 3 30B MoE (Q8) | - | ~320 tok/s | 10+ steps |
Setting Up a Local Agent
With Ollama + OpenClaw
OpenClaw is the open-source agent framework designed for local models. Ollama 0.17+ has native OpenClaw integration.
# Pull your agent model
ollama pull qwen3.5:9b
# Install OpenClaw
pip install openclaw
# Run an agent with tool access
openclaw run --model qwen3.5:9b --tools web,files,shell
We have a detailed guide: OpenClaw + Ollama: Run AI Agents Locally.
With Ollama + MCP (Model Context Protocol)
MCP connects your local model to external data sources and tools. Qwen-Agent has native MCP support:
pip install qwen-agent
# Configure MCP servers in your agent config
# The Qwen-Agent framework handles tool routing automatically
With LM Studio + Function Calling
LM Studio supports OpenAI-compatible function calling, making it work with any framework that expects the OpenAI API format. Load your model, enable the API server, and point your agent framework at http://localhost:1234/v1.
What Local Agents Can and Can’t Do
Works well:
- Smart home control (turn on lights, adjust thermostat, play music)
- Simple API integrations (check weather, send notifications, query databases)
- File operations (organize files, rename batches, extract data from documents)
- RAG pipelines (search documents, retrieve context, answer questions)
- 2-3 step workflows on 8GB, 5-8 steps on 24GB+
Struggles with:
- Complex web browsing (navigating multi-step forms, handling CAPTCHAs)
- Long planning horizons (10+ step workflows on models below 30B)
- Ambiguous tool selection (when multiple tools could work, smaller models guess wrong more often)
- Error recovery from unexpected states (API timeouts, format changes)
- Parallel tool execution reliability (calling 3+ tools simultaneously)
Not realistic yet:
- Fully autonomous agents that run for hours unsupervised
- Complex research tasks requiring judgment about source reliability
- Multi-agent coordination (requires models that can communicate reliably)
The 24GB+ tier with MoE models is where local agents become genuinely useful for real workflows. Below that, they work as automation assistants for well-defined, simple tasks.
Security Note
Running agents locally doesn’t automatically make them safe. A local model with shell access can still delete files or expose data. Always:
- Sandbox agents in containers (Docker, Firejail)
- Limit tool permissions to what’s actually needed
- Log all tool calls for review
- Never give agents write access to critical systems without confirmation gates
For a complete security guide, see our AI Agent Containment guide.