Local AI Showdown: The Best Open-Weight Models for Your Hardware in February 2026

A tier-by-tier comparison of the top open-weight LLMs you can run locally, from 8GB laptops to 24GB gaming GPUs to Apple Silicon Macs.

The open-weight model scene has moved fast over the past few months. OpenAI shipped its first open-source models. Alibaba dropped Qwen 3.5 with frontier-level performance. Zhipu AI’s GLM-5 topped the intelligence rankings. And Google’s quantization-aware training made 27B models fit on cards that used to struggle with 13B.

The question isn’t whether local AI is viable anymore. It’s which model to run on which hardware.

We broke down the current state of open-weight models into four tiers based on what you actually have sitting on your desk. Every recommendation accounts for quantization overhead, real-world token speeds, and practical usability - not just benchmark scores.

The Rules

All models referenced here are open-weight with permissive or commercially usable licenses. Performance figures come from community benchmarks using Ollama, llama.cpp, or vLLM on the specified hardware. Quantization level is Q4_K_M unless noted otherwise - it’s the standard balance between quality and memory savings.

One critical thing to understand about running models locally: if the model doesn’t fit entirely in your GPU’s VRAM, layers get offloaded to system RAM. Each token then requires a round-trip across the PCIe bus. The result is roughly a 10x speed penalty. The line between “fits in VRAM” and “partially offloaded” is the single most important factor in local model performance.

Tier 1: 8GB VRAM (RTX 4060, RTX 3060 8GB, integrated graphics)

Your sweet spot: 7B–8B parameter models

This is the entry-level tier, and it’s better than you’d expect. An 8B model with Q4_K_M quantization needs about 6–7GB of VRAM, leaving just enough headroom for the KV cache during inference.

Top picks

Qwen 3 8B - The overall winner at this size. Strong reasoning, good multilingual support, and solid math performance. Runs at 40–53 tokens/second on an RTX 4060, which is faster than reading speed. The hybrid thinking mode lets it switch between quick responses and step-by-step reasoning depending on the prompt.

Gemma 3 4B (QAT) - Google’s quantization-aware trained version deserves special attention. Because it was trained with quantization in mind rather than quantized after the fact, the 4B model punches well above its weight. It roughly matches Gemma 2 9B performance while using half the memory. If you’re extremely VRAM-constrained (think 6GB cards or integrated graphics), this is the one.

Phi-4 14B (at Q3 quantization) - Microsoft’s reasoning specialist. At Q3_K_M quantization it squeezes into 8GB VRAM, though you’re giving up some quality for the fit. Scores 84.8 on MMLU and 82.6 on HumanEval despite its size. If your use case is primarily math, logic, or code generation, the reasoning ability at this size is unmatched.

What to expect

General conversation, summarization, simple coding tasks, and writing assistance all work well at this tier. You’ll notice the models struggle with complex multi-step reasoning, long context (most 8B models cap at 8K–32K usable context), and tasks that require deep domain knowledge. But for a model running on a $300 GPU with zero data leaving your machine, it’s genuinely impressive.

Tier 2: 16GB VRAM (RTX 4060 Ti 16GB, RTX 4080, RTX A4000)

Your sweet spot: 14B–20B parameter models

The jump from 8B to 14B–20B is where local AI starts feeling like a real tool instead of a novelty. You get noticeably better coherence, stronger reasoning, and the ability to handle tasks that stump smaller models.

Top picks

GPT-OSS 20B - OpenAI’s first open-source model, and it’s a statement. Ships natively quantized in MXFP4 format, needing just 16GB of memory. Matches or exceeds OpenAI o3-mini on competition math, health queries, and general reasoning benchmarks. Runs at around 140 tokens/second on an RTX 4080. Apache 2.0 license. The fact that an OpenAI model now runs on consumer hardware under an open license would have been unthinkable a year ago.

Qwen 3 14B - The best general-purpose model at this tier. Delivers 62 tokens/second with a 12GB memory footprint, leaving room for context and parallel workloads. Hybrid reasoning mode, strong coding ability, and excellent instruction following. It’s the default recommendation for most people with 16GB cards.

Ministral 3 14B - Mistral’s edge-optimized model. Hits 70 tokens/second while fitting entirely in 16GB VRAM. The reasoning variant scores 85% on AIME ‘25, which is competition math territory. All under Apache 2.0. Particularly strong at producing concise, well-structured output - it tends to generate fewer tokens than comparable models while maintaining quality, which translates to faster effective task completion.

What to expect

At this tier, you can handle most day-to-day AI tasks without feeling limited. Code generation and review works well. Long-form writing is coherent. Simple agentic tasks (tool calling, multi-step workflows) become viable. Context windows push to 128K tokens on some models, though actual usable context is typically lower before quality degrades.

Tier 3: 24GB VRAM (RTX 4090, RTX 3090, RTX A5000)

Your sweet spot: 27B–32B parameter models (or quantized 70B)

This is the tier where local AI gets serious. The RTX 4090 remains the consumer king - 24GB of GDDR6X with 1.01 TB/s bandwidth. It’s enough to run 32B models at full Q4 quantization with room to spare, or push into 70B territory with aggressive quantization.

Top picks

Qwen 3 32B - The best all-around model on a 24GB card. Needs about 22–24GB at Q4_K_M. Performance at this size represents a clear step up from the 14B class across every benchmark. If you want one model that handles everything - coding, writing, analysis, math, conversation - this is it.

EXAONE 4.0 32B - LG AI Research’s hybrid reasoning model that scores 62 on the Artificial Analysis Intelligence Index in reasoning mode, matching models many times its size. Trained on 14 trillion tokens with 128K context support. The non-reasoning mode scores 51, making it versatile for different workload types. One caveat: the license restricts commercial use, so check the terms if you’re building a product.

Gemma 3 27B (QAT) - Google’s quantization-aware trained version drops from 54GB at BF16 to just 14.1GB at int4. That’s unusually efficient - you can run it with massive context headroom on a 24GB card. Earned a 1,338 ELO in Chatbot Arena, putting it in the top ten ahead of OpenAI o1. Strong vision capabilities too, if you need image understanding.

DeepSeek R1 Distill 32B - If reasoning is your priority, this distilled version of DeepSeek’s R1 retains most of the full model’s step-by-step problem-solving ability. Particularly good at math, code, and structured analysis. The trade-off is that it’s slower at simple tasks because it defaults to verbose reasoning chains.

Quantized 70B models - The RTX 4090 can handle Llama 3.3 70B or Qwen 2.5 72B at Q3 or Q4 quantization, hitting about 26–52 tokens/second depending on the model and quantization level. You lose some quality from the heavy quantization, but a 70B model - even quantized - outperforms a 32B model at full precision on most tasks. Worth trying if response quality matters more than speed.

What to expect

At this tier, local models start competing with cloud API responses. Complex reasoning chains work. Multi-turn conversations maintain coherence over long exchanges. Agentic coding tasks - where the model needs to plan, execute, and self-correct - become practical. The main remaining gap with frontier cloud models is in extremely long contexts and bleeding-edge tool use.

Tier 4: Apple Silicon (M-series Macs with 32GB–128GB unified memory)

Your sweet spot: depends entirely on unified memory

Apple Silicon rewrites the rules because unified memory means the GPU and CPU share the same memory pool with no copy penalty. An M4 Max with 128GB unified memory can load models that would need multiple NVIDIA GPUs.

The catch: memory bandwidth. LLM inference is almost entirely bandwidth-bound during token generation, and Apple Silicon bandwidth varies dramatically by chip. An M3 Max (400 GB/s) actually generates tokens faster than an M4 Pro (273 GB/s) despite being the older chip. So pay attention to bandwidth, not just total memory.

Top picks by memory tier

32GB (M4 Pro base): Qwen 3 14B or GPT-OSS 20B at Q4 quantization. Expect 25–35 tokens/second. The same models that shine on 16GB NVIDIA cards work well here, with more headroom for context.

64GB (M4 Pro/Max): Qwen 3 32B at Q4–Q5 is the top recommendation, running at 12–18 tokens/second. Qwen3-Coder-Next 80B (3B active parameters) also fits at Q4_K_M - its 48.5GB quantized weight size squeaks in under 64GB with room for overhead. This is a coding-specialized model that beats DeepSeek V3.2 on SWE-Bench while using 12x less compute per token.

96–128GB (M4 Max/Ultra): This is where things get wild. Llama 3.3 70B runs at Q4–Q5 quantization at 8–15 tokens/second. Qwen 3.5 397B-A17B - a frontier-class model with only 17B active parameters - fits at 3-bit quantization on 192GB or 4-bit on 256GB, generating around 25+ tokens/second with MoE offloading. You’re running a model that competes with GPT-5.2 on your laptop.

MLX matters

If you’re on Apple Silicon, use the MLX backend rather than the standard GGUF/llama.cpp path. MLX is built specifically for Apple’s hardware and consistently outperforms the generic path by 20–40% on token generation speed.

The MoE Wild Card

Mixture of Experts models have changed what’s possible at every tier. The key insight: a 400B MoE model with 17B active parameters needs to store all 400B parameters in memory, but only computes with 17B per token. So inference speed is similar to a 17B dense model, but you need the memory of a 400B model.

This makes MoE models perfect for high-memory, bandwidth-rich setups (read: Apple Silicon Macs). The standouts:

  • Qwen 3.5 397B-A17B (February 2026) - Frontier performance. 83.6 on LiveCodeBench, 91.3 on AIME26. Needs ~214GB at Q4 via Unsloth’s dynamic quantization.
  • GLM-5 745B-A44B (February 2026) - Tops the intelligence index with 50.0 score. Lowest hallucination rate in the industry. Needs server hardware or a massive Mac.
  • Llama 4 Scout 109B-A17B - 10 million token context window. Strong multimodal capabilities. Fits on a 64GB Mac at Q4.
  • Qwen3-Coder-Next 80B-A3B - Only 3B active parameters but 71.3% on SWE-Bench Verified. Fits on a 64GB Mac at Q4. The best coding model you can run locally on consumer hardware, period.

Quick Reference: Best Model Per Hardware

HardwareBest GeneralBest CodingBest ReasoningSpeed
8GB VRAMQwen 3 8BQwen 3 8BPhi-4 14B (Q3)40–53 t/s
16GB VRAMGPT-OSS 20BGPT-OSS 20BMinistral 3 14B60–140 t/s
24GB VRAMQwen 3 32BQwen 3 32BEXAONE 4.0 32B30–52 t/s
Mac 64GBQwen 3 32BQwen3-Coder-NextQwen 3 32B12–18 t/s
Mac 128GBLlama 3.3 70BQwen3-Coder-NextDeepSeek R1 70B8–15 t/s

How to Get Started

All of these models are available through Ollama with a single command:

# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run any model
ollama run qwen3:8b          # 8GB tier
ollama run gpt-oss:20b       # 16GB tier
ollama run qwen3:32b         # 24GB tier

For Apple Silicon users wanting the MLX backend, LM Studio provides a clean interface with automatic MLX optimization. For power users who want more control, llama.cpp and vLLM offer the best performance tuning options.

If you want a full ChatGPT-style web interface running locally, check out our guide to setting up Ollama with Open WebUI - published earlier today.

The Bottom Line

February 2026 is the best time yet to run AI locally. OpenAI has open-source models now. A $300 GPU runs models that beat GPT-3.5. A $1,600 Mac Mini handles models that rival GPT-4. And a maxed-out Mac Studio runs frontier-class 400B parameter models from your desk.

The performance gap between local and cloud is shrinking every month. For many tasks - writing, coding, analysis, research - it’s already closed. The remaining advantages of cloud models are in bleeding-edge reasoning, extremely long contexts, and complex multi-modal tasks. For everything else, you can keep your data on your own machine and still get excellent results.

Your hardware decides your tier. This guide tells you which model to pick once you know your tier. Beyond that, the only step left is ollama pull.