Open-Weight LLM Showdown: What Actually Runs on Your GPU (March 2026)

The open-weight LLM space looks nothing like it did six months ago. Chinese labs have stormed past Western open models, Mixture of Experts has gone mainstream, and the gap between local inference and cloud APIs keeps shrinking.

Here’s what’s actually competitive in March 2026, what hardware you need, and the tradeoffs nobody talks about.

The New Frontier Leaders

GLM-5 (744B Parameters, 40B Active)

Zhipu AI’s GLM-5, released February 11, is the first open-weight model to hit 50 on the Artificial Analysis Intelligence Index. Previous best was 42 (GLM-4.7).

The headline numbers:

SWE-bench Verified: 77.8%
GPQA Diamond: 86.0%
AIME 2026: 92.7%
BrowseComp: 75.9

That SWE-bench score matters. It means the model can actually write working code, not just pass synthetic benchmarks. GLM-5 is designed as an agentic coding model first - it performs best when it can execute code, search documentation, and use external tools.

The catch: you’re not running this locally. At 744B parameters total, even INT4 quantized weights need around 200GB. That’s Mac Studio M3 Ultra territory ($8,000+) for slow inference, or enterprise H100 clusters for production speeds.

Qwen 3.5 (397B MoE, 17B Active)

Alibaba’s flagship Qwen 3.5 takes the best GPQA Diamond score on any leaderboard at 88.4, edges out GPT-4o and DeepSeek-V3 on most public benchmarks, and does it with 17B active parameters.

Key scores:

LiveCodeBench v6: 83.6
AIME 2026: 91.3
IFEval: 92.6

Two days ago, Alibaba also dropped the Qwen 3.5 Small series - 0.8B to 9B parameter models that run on phones. The 9B model matches GPT-OSS-120B on GPQA Diamond (81.7 vs 71.5) and MMMU-Pro (70.1 vs 59.7).

A 9B model that beats a 120B model. That’s the state of efficient architecture in 2026.

MiniMax M2.5 (230B MoE, 10B Active)

MiniMax M2.5 dropped the same day as GLM-5. At 80.2% on SWE-Bench Verified, it’s the open-weight coding king for practical deployment.

Why it matters for local users: 10B active parameters per forward pass. That’s dramatically more runnable than GLM-5. The model completes SWE-Bench 37% faster than its predecessor, matching Claude Opus 4.6’s speed.

Running cost via API: $1 per hour at 100 tokens/second. Running locally: feasible on high-end consumer hardware with aggressive quantization.

DeepSeek V3.2 (685B MoE)

DeepSeek V3.2 matches GPT-5 on multiple reasoning benchmarks. The specialized V3.2-Speciale variant beats GPT-5 and ties Gemini 3.0 Pro on STEM benchmarks.

It won gold at IOI 2025, ICPC World Final 2025, IMO 2025, and CMO 2025. Competition-level performance in an open-weight model.

The agentic capabilities stand out. DeepSeek explicitly trained for long-tail agent tasks - the kind that require multi-step planning and tool use, not just single-turn responses.

What Actually Fits on Consumer Hardware

Here’s the uncomfortable truth: frontier open-weight models are too big to run locally without serious hardware investment.

But the smaller models have gotten remarkably good.

RTX 4090 (24GB VRAM)

The 4090 remains the sweet spot for serious local inference. At 24GB GDDR6X, you can run:

30B MoE models: 195 tokens/second at 4K context, 75 tokens/second at 57K context
7-8B models: 100-140 tokens/second
Quantized 27B models: Gemma 3 27B drops from 54GB to 14GB with INT4, fits comfortably

With 4-bit quantization, a model that needs 14GB in FP16 shrinks to 4-5GB. The quality tradeoff? Benchmarks show Q5_K_M retains roughly 95% of original quality at 2x speed improvement. Q6_K and Q5_K are nearly indistinguishable from original weights in blind testing.

Best bets for 24GB:

Qwen 3.5-9B (new, excellent quality-to-size ratio)
Gemma 3 27B quantized
Mistral 3 14B
Llama 4 Scout (17B active, but needs quantization)

Apple Silicon (M-series)

The M4 Pro with 48-64GB unified memory can run models that choke discrete GPUs - but at a significant speed penalty.

The hardware handles the memory. A 64GB M4 Pro can load models that would need multiple GPUs on the PC side. But token generation lags considerably behind. Where an RTX 4090 hits 100+ tokens/second on an 8B model, M-series typically delivers 20-40 tokens/second.

The tradeoff: you can run bigger models, but slower. For coding assistance where you wait between requests anyway, this matters less than for interactive chat.

Budget Builds (8-16GB VRAM)

If you’re on an RTX 4060 or 4070, focus on:

Qwen 3.5-4B and 9B
Mistral 3 8B
Gemma 3 9B

These models punch well above their weight class. The 9B tier has closed the gap dramatically with last year’s 70B models on most practical tasks.

Inference Stack: Ollama vs llama.cpp vs vLLM

Three options dominate local inference:

Ollama: Easiest setup. Pull a model with one command. Built on llama.cpp but adds a nice API layer. The catch: 10-30% overhead compared to raw llama.cpp. In some benchmarks, llama.cpp generates 161 tokens/second while Ollama manages 89 - 1.8x slower.

llama.cpp: Raw performance. Full hardware control. The C++ core is incredibly fast for single-user inference. Pain point: concurrency. Multiple requests queue linearly, making time-to-first-token spike under load.

vLLM: Production-grade serving. Handles concurrent users well. Requires more setup and GPU memory overhead, but scales properly.

Recommendation: Start with Ollama for prototyping. Migrate to llama.cpp when you need production performance or vLLM when you need multiple users.

The Kimi K2.5 Elephant in the Room

Kimi K2.5 from Moonshot AI deserves mention for what it represents: 1 trillion parameters, 32B active, native multimodal, scoring 47 on the Intelligence Index.

It achieves 96.1% on AIME 2025. It generates code from visual specifications - UI designs, video workflows - and orchestrates tools based on visual input.

The local reality: you need ~600GB for INT4 weights. The 1.8-bit quantized version fits on a single 24GB GPU if you offload MoE layers to system RAM, but expect around 10 tokens/second with 256GB of system memory. That’s “technically possible” territory, not “actually practical.”

The Chinese Lab Advantage

Looking at this list, the pattern is obvious: GLM-5, Qwen 3.5, DeepSeek V3.2, MiniMax M2.5, Kimi K2.5. Chinese labs dominate frontier open-weight models.

Meta’s Llama 4 series exists - Maverick beat GPT-4o and Gemini 2.0 Flash on Chatbot Arena - but it’s playing catch-up on the metrics that matter for agentic use. Mistral Large 3 is competitive but requires more resources than the Chinese alternatives for similar performance.

For local users, this matters because of licensing. GLM-5 ships under MIT. Qwen uses Apache 2.0. These are genuinely permissive terms. You can fine-tune them, deploy them commercially, and build products without navigating restricted use clauses.

What This Means

The practical gap between cloud AI and local inference continues to narrow - but not uniformly.

For everyday tasks (summarization, simple coding, chat), a quantized 9B model on consumer hardware is genuinely competitive with GPT-4. For complex agentic workflows and frontier capabilities, you still need either expensive local hardware or cloud APIs.

The middle ground has gotten crowded with excellent options. If you have a 4090 or recent Mac, the Qwen 3.5-9B release from two days ago is worth immediate attention. It represents the current efficiency frontier for models that actually fit on consumer hardware.

What You Can Do

Have 24GB VRAM? Try Qwen 3.5-9B or Gemma 3 27B (Q4 quantized). Both run well on Ollama with minimal setup.
Limited to 8-16GB? Qwen 3.5-4B or Mistral 3 8B. Surprisingly capable for the size.
Want frontier performance locally? Wait for more aggressive quantization work on GLM-5 and MiniMax M2.5, or budget for a Mac Studio with 128GB+ unified memory.
Building agentic applications? Look at DeepSeek V3.2 or MiniMax M2.5 via API while hardware catches up. Their agentic training shows in real-world agent benchmarks.

The gap isn’t closed, but it’s closing faster than the cloud providers would like.