Best Local Models for AI Agents in 2026: Tool Use and Function Calling by GPU Tier

Local AI by VRAM Tier - 8GB | 12GB | 16GB | 24GB | 32GB

Deep dives: Chat | Coding | Translation | Vision | Speech | Agents

AI agents that browse the web, query databases, control smart home devices, and chain tools together to solve problems - all running on your own GPU. That’s the promise. The reality depends on whether your model can reliably call functions without hallucinating parameters or losing track of multi-step plans.

Function calling is the hardest task for small models. A chat model can get away with an imprecise answer. A coding model can produce code that’s mostly right. But a model calling a function with the wrong argument crashes the workflow. There’s no partial credit.

We tested local models on tool use benchmarks to find which ones actually work as agents at each VRAM tier.

The Benchmarks That Matter

BFCL V4 - Berkeley Function Calling Leaderboard. Tests single-turn and multi-turn function calls, serial and parallel execution, and multi-language support. The standard benchmark for tool use reliability
TAU2-bench - Real-world tool-augmented tasks. Tests whether models can complete end-to-end workflows using tools
ScreenSpot Pro - GUI understanding and interaction. Can the model navigate a desktop application?
OSWorld-Verified - Desktop automation tasks. The hardest agentic benchmark - complete real tasks on a real operating system

The Models

Model	Params	BFCL V4	TAU2-bench	VRAM (Q4)	Tool Call Format
Qwen 3.5 9B	9B	66.1	79.1	~7 GB	Native Qwen tool format
Qwen 3.5 4B	4B	-	-	3.4 GB	Native Qwen tool format
Qwen 3 8B	8B	~55	-	6.5 GB	Native Qwen tool format
Qwen 3 14B	14B	~62	-	10.7 GB	Native Qwen tool format
Qwen 3 32B	32B	~68	-	22.2 GB	Native Qwen tool format
Qwen 3 30B MoE	30B (3B active)	-	-	~18 GB	Native Qwen tool format
EXAONE 4.0 32B	32B	-	-	~22 GB	OpenAI-compatible
Llama 3.1 8B	8B	~48	-	6.2 GB	Llama tool format
Mistral Small 22B	22B	-	-	~14 GB	Mistral tool format

For context, GPT-5 mini scores 55.5 on BFCL V4. Qwen 3.5 9B’s 66.1 beats it by 19%. On TAU2-bench (79.1), it outperforms models 3-13x its size. This is the strongest small model result for agent tasks.

What Makes a Good Agent Model?

Beyond benchmarks, practical agent reliability depends on:

Structured output - does it format JSON function calls correctly, every time?
Stop token discipline - does it stop after the function call, or keep generating text that breaks parsing?
Multi-step planning - can it chain 3+ tool calls without losing track of the goal?
Error recovery - when a tool returns an error, does it retry with corrected parameters or spiral?
Temperature sensitivity - at temperature 0, is the output deterministic and parseable?

The Qwen models excel at points 1-3. They use a well-defined tool call template that frameworks parse easily. Llama models are more variable - they work, but require more careful prompt engineering for reliable structured output.

8GB VRAM {#8gb}

GPUs: RTX 4060, RTX 3060 8GB, RTX 3070

Candidates

Model	BFCL V4	TAU2-bench	VRAM	Reliability
Qwen 3.5 9B	66.1	79.1	~7 GB	High
Qwen 3 8B	~55	-	6.5 GB	Good
Qwen 3.5 4B	-	-	3.4 GB	Moderate
Llama 3.1 8B	~48	-	6.2 GB	Variable

Winner: Qwen 3.5 9B

The Qwen 3.5 9B is the clear pick. At 66.1 BFCL V4, it outperforms GPT-5 mini on function calling while fitting in 7GB. The TAU2-bench score (79.1) confirms it can handle real multi-step workflows, not just synthetic benchmarks.

Both “thinking” and “non-thinking” modes are supported. Use thinking mode when the agent needs to reason about which tool to call. Use non-thinking mode for speed when the tool choice is obvious.

ollama pull qwen3.5:9b

Budget option: Qwen 3.5 4B at 3.4GB. The function calling works but with lower reliability on complex chains. Good enough for simple automations - “check the weather and send me a notification” - but not for multi-step workflows.

The honest take: 8GB agent models can handle 2-3 step tool chains reliably. Beyond that, they start losing context and making mistakes. For simple automations (smart home control, single API calls, basic data retrieval), they work well. For complex multi-step agents (research tasks, multi-tool workflows), you need more VRAM.

For all use cases at this level, see our 8GB VRAM complete guide.

12GB VRAM {#12gb}

GPUs: RTX 3060 12GB, RTX 4070

Candidates

Model	BFCL V4	VRAM	Speed
Qwen 3 14B	~62	10.7 GB	~50 tok/s
Qwen 3.5 9B	66.1	~7 GB	~55 tok/s
Phi-4 14B	-	~10 GB	25-35 tok/s

Winner: Qwen 3 14B

At 12GB, you can run the 14B model with decent context headroom. The Qwen 3 14B has native tool calling support and more consistent structured output than the 8B. The extra parameters translate directly to better multi-step planning - it can maintain a 4-5 step tool chain without losing the thread.

Alternative: Qwen 3.5 9B with 5GB of context headroom. If your agent tasks involve large tool responses (database queries, web page contents), the extra context buffer matters more than the 14B’s marginal quality improvement.

For all use cases at this level, see our 12GB VRAM complete guide.

16GB VRAM {#16gb}

GPUs: RTX 4060 Ti 16GB, RTX 5060, Intel Arc A770

Candidates

Model	BFCL V4	VRAM	Best For
Qwen 3 14B (Q5)	~62	~12 GB	Higher quant, better reliability
GPT-OSS 20B	-	~14 GB	OpenAI-format tool calls
Mistral Small 22B	-	~14 GB	Multi-tool chains
Qwen 3.5 9B	66.1	~7 GB	Speed + room for tools

Winner: Qwen 3 14B at Q5_K_M

The quality bump from Q4 to Q5 quantization matters more for agents than for chat. At Q5_K_M (~12GB), function call formatting becomes more reliable - fewer malformed JSON outputs, better parameter accuracy. The 4GB headroom on a 16GB card gives you room for tool response context.

OpenAI-compatible: GPT-OSS 20B at ~14GB. If your agent framework expects OpenAI-format function calls (most do), GPT-OSS speaks that format natively. The speed advantage (~140 tok/s) means faster tool chain execution.

Multi-tool specialist: Mistral Small 22B at ~14GB. Mistral models have strong parallel function calling - they can decide to call multiple tools simultaneously when the calls are independent. If your workflow involves “fetch weather AND check calendar AND query database” in parallel, Mistral handles this pattern well.

For all use cases at this level, see our 16GB VRAM complete guide.

24GB VRAM {#24gb}

GPUs: RTX 3090, RTX 4090

This is where local agents get serious. The 32B models can maintain complex multi-step workflows and handle long tool response chains.

Candidates

Model	BFCL V4	VRAM	Speed (RTX 4090)
Qwen 3 32B	~68	22.2 GB	34 tok/s
Qwen 3 30B MoE	-	~18 GB	196 tok/s
EXAONE 4.0 32B	-	~22 GB	~30 tok/s
Gemma 3 27B QAT	-	~14 GB	~40 tok/s

Winner: Qwen 3 30B MoE

For agents, speed matters as much as quality. Each tool call is a round trip - the model generates the call, the tool executes, the response comes back, the model processes it. A 5-step workflow means 5+ inference passes. At 196 tok/s on the RTX 4090, the Qwen 3 30B MoE completes these chains in seconds rather than minutes.

The MoE architecture activates only 3B parameters per token, so it’s fast, but the full 30B parameter space gives it broad tool-calling knowledge. At ~18GB, you have 6GB free for context - enough for complex tool responses.

Quality ceiling: Qwen 3 32B dense at ~68 BFCL V4. When reliability matters more than speed (financial transactions, critical automations), the dense model’s higher accuracy is worth the 6x speed penalty.

Dark horse: EXAONE 4.0 32B. LG’s model has strong reasoning capabilities (92.3 MMLU-Redux) which translates to better planning in multi-step agent tasks. Less community testing for agents specifically, but worth trying if Qwen doesn’t fit your workflow.

For all use cases at this level, see our 24GB VRAM complete guide.

32GB VRAM {#32gb}

GPUs: RTX 5090

Candidates

Model	BFCL V4	VRAM	Notes
Qwen 3 32B (Q6)	~68	~28 GB	Near-lossless agent quality
Qwen 3 30B MoE (Q8)	-	~28 GB	Maximum speed + quality
Llama 3.3 70B (Q3)	-	~32 GB	Massive context understanding

Winner: Qwen 3 30B MoE at Q8_0

Near-lossless MoE at estimated ~320 tok/s on the RTX 5090. Complex 10-step agent workflows complete in seconds. The Q8 quantization eliminates virtually all accuracy loss from compression.

Complex planning: Qwen 3 32B at Q6_K (~28GB). When your agent needs to plan and reason deeply before acting - research tasks, data analysis pipelines, multi-document processing - the dense model’s stronger reasoning justifies the slower speed.

For all use cases at this level, see our 32GB VRAM complete guide.

Cross-Tier Summary

Tier	Best Pick	BFCL V4	Speed	Practical Chain Length
8GB	Qwen 3.5 9B	66.1	~40 tok/s	2-3 steps
12GB	Qwen 3 14B	~62	~50 tok/s	4-5 steps
16GB	Qwen 3 14B (Q5)	~62	~55 tok/s	4-5 steps
24GB	Qwen 3 30B MoE	-	196 tok/s	8-10 steps
32GB	Qwen 3 30B MoE (Q8)	-	~320 tok/s	10+ steps

Setting Up a Local Agent

With Ollama + OpenClaw

OpenClaw is the open-source agent framework designed for local models. Ollama 0.17+ has native OpenClaw integration.

# Pull your agent model
ollama pull qwen3.5:9b

# Install OpenClaw
pip install openclaw

# Run an agent with tool access
openclaw run --model qwen3.5:9b --tools web,files,shell

We have a detailed guide: OpenClaw + Ollama: Run AI Agents Locally.

With Ollama + MCP (Model Context Protocol)

MCP connects your local model to external data sources and tools. Qwen-Agent has native MCP support:

pip install qwen-agent

# Configure MCP servers in your agent config
# The Qwen-Agent framework handles tool routing automatically

With LM Studio + Function Calling

LM Studio supports OpenAI-compatible function calling, making it work with any framework that expects the OpenAI API format. Load your model, enable the API server, and point your agent framework at http://localhost:1234/v1.

What Local Agents Can and Can’t Do

Works well:

Smart home control (turn on lights, adjust thermostat, play music)
Simple API integrations (check weather, send notifications, query databases)
File operations (organize files, rename batches, extract data from documents)
RAG pipelines (search documents, retrieve context, answer questions)
2-3 step workflows on 8GB, 5-8 steps on 24GB+

Struggles with:

Complex web browsing (navigating multi-step forms, handling CAPTCHAs)
Long planning horizons (10+ step workflows on models below 30B)
Ambiguous tool selection (when multiple tools could work, smaller models guess wrong more often)
Error recovery from unexpected states (API timeouts, format changes)
Parallel tool execution reliability (calling 3+ tools simultaneously)

Not realistic yet:

Fully autonomous agents that run for hours unsupervised
Complex research tasks requiring judgment about source reliability
Multi-agent coordination (requires models that can communicate reliably)

The 24GB+ tier with MoE models is where local agents become genuinely useful for real workflows. Below that, they work as automation assistants for well-defined, simple tasks.

Security Note

Running agents locally doesn’t automatically make them safe. A local model with shell access can still delete files or expose data. Always:

Sandbox agents in containers (Docker, Firejail)
Limit tool permissions to what’s actually needed
Log all tool calls for review
Never give agents write access to critical systems without confirmation gates

For a complete security guide, see our AI Agent Containment guide.