Three days ago, this column reported that Gemma 4’s inference speed problem made it a benchmark trophy that couldn’t run properly on consumer hardware. Since then, llama.cpp shipped a KV cache fix that slashes VRAM usage by 40%, Zhipu AI open-sourced GLM-5.1, and Qwen 3.6 Plus arrived with a 1-million-token context window.
The open-weight field hasn’t just gotten crowded. It’s gotten absurd. Six major labs now ship competitive models under permissive licenses. The question isn’t which one is best on paper anymore—it’s which one actually runs fast on your hardware.
The Six Contenders
Here’s where the open-weight landscape stands as of this week:
| Model | Total Params | Active Params | Context | License | Speed (RTX 4090, Q4) |
|---|---|---|---|---|---|
| Gemma 4 31B | 31B | 31B | 256K | Apache 2.0 | ~25 tok/s |
| Gemma 4 26B-A4B | 26B | 3.8B | 256K | Apache 2.0 | ~11 tok/s* |
| Qwen 3.5 27B | 27B | ~27B | 128K | Apache 2.0 | ~35 tok/s |
| Qwen 3.6 Plus | Undisclosed | Undisclosed | 1M | Apache 2.0 | API only |
| Llama 4 Scout | 109B | 17B | 10M | Community | ~20 tok/s |
| Llama 4 Maverick | 400B | 17B | 1M | Community | Datacenter only |
| Mistral Small 4 | 119B | 8B | 256K | Apache 2.0 | ~28 tok/s |
| GPT-OSS 120B | 117B | 5.1B | 131K | Apache 2.0 | ~30 tok/s |
| GPT-OSS 20B | 21B | 3.6B | 131K | Apache 2.0 | ~45 tok/s |
| GLM-5.1 | 744B | 40B | 200K | MIT | Datacenter only |
*Post-KV cache fix. See below.
Two things jump out. First, every model except Llama uses a fully permissive open-source license. Meta’s 700-million-MAU restriction is looking increasingly lonely. Second, Mixture-of-Experts has won. Eight of the ten models listed use MoE routing to cut active parameters during inference.
Gemma 4’s Speed Gets a Patch (But Not a Cure)
Last week’s headline was Gemma 4’s FlashAttention incompatibility causing 9 tok/s on an RTX 4090 for models that should have been hitting 60+. The community responded fast.
The llama.cpp team shipped a KV cache fix that restructures how the framework handles Gemma 4’s multi-head attention grouping. The result: nearly 40% reduction in VRAM usage during context-heavy operations. A Gemma 4 26B-A4B model now fits on a 24 GB RTX 3090 or 4090 with 8K context, which is a significant improvement from the absurd memory bloat that hit at launch.
A separate patch adds per-head entropy-adaptive bit allocation, achieving 4x KV cache compression on hybrid attention models like Gemma 4 while maintaining lossless quality. This is clever engineering—instead of applying uniform quantization across all attention heads, it allocates bits based on each head’s actual information density.
AMD also rolled out Gemma 4 support across its full GPU and CPU lineup, giving ROCm users an option besides Nvidia.
The throughput numbers have improved. The 31B dense model now manages 30-34 tok/s on an RTX 3090 with the KV cache fix, and the 26B-A4B hits roughly 64-119 tok/s on the same hardware depending on quantization and batch size. That’s a massive jump from the initial 11 tok/s figure—though the 64-119 range comes from optimized benchmarks rather than real conversational workloads.
The FlashAttention core problem remains unsolved. Gemma 4’s heterogeneous attention head dimensions still force Triton fallback in vLLM, meaning production serving is still slower than it should be. The memory fix helps local users enormously. Enterprise serving still hurts.
Qwen 3.6 Plus: The Agentic Speed Machine
Alibaba’s Qwen 3.6 Plus Preview dropped on March 31 and it’s making a strong case for the agentic coding crown.
The headline numbers: 1-million-token context window, 65,536 max output tokens, always-on chain-of-thought reasoning, and a hybrid architecture combining linear attention with sparse MoE routing. Early community benchmarks show roughly 3x the throughput of Claude Opus 4.6 on comparable tasks—though those figures come from API comparisons, not local inference.
On coding specifically, Qwen 3.6 Plus scores 61.6 on Terminal-Bench 2.0, edging out Claude 4.5 Opus at 59.3. Its 78.8 on SWE-bench Verified puts it just behind the leaders, and its 48.2% on MCPMark (a tool-calling reliability benchmark) leads the field.
The catch: Qwen 3.6 Plus is currently API-only via OpenRouter. No weights have been released. For this column, that means it’s relevant as a competitive benchmark but not yet a local AI option. If Alibaba follows its usual pattern, open weights for the base model will follow—but when is unclear.
For local users, Qwen 3.5 27B remains the pick from Alibaba’s lineup. It still hits 35 tok/s on an RTX 4090 at Q4 quantization and holds strong benchmark scores across the board.
OpenAI Joins the Open-Weight Club
Here’s a sentence nobody expected to write: OpenAI released open-weight models under Apache 2.0.
GPT-OSS comes in two sizes. The 120B model is a 117B-parameter MoE that activates 5.1B parameters per token and runs on a single 80 GB GPU. It matches or exceeds OpenAI’s own o4-mini on competition coding, general reasoning, and tool calling. The 20B model activates 3.6B parameters and runs on devices with just 16 GB of memory.
Both trained using reinforcement learning informed by OpenAI’s o3 and other frontier systems. The 120B in particular performs well on health-related queries (HealthBench) and competition math (AIME 2024/2025), which suggests the RL training pipeline carried meaningful signal from the internal models.
For consumer hardware users, GPT-OSS 20B is the interesting one. At 3.6B active parameters, it’s competitive with models three times its total size while fitting on hardware most people already own. Early reports suggest roughly 45 tok/s on an RTX 4090—faster than anything else at its capability level.
GLM-5.1: The 744-Billion-Parameter Wildcard
Zhipu AI (Z.ai) open-sourced GLM-5.1 weights this weekend under an MIT license. At 744 billion total parameters with 256 experts and 40B active per token, it’s the largest open-weight model available.
Z.ai claims GLM-5.1 hits 94.6% of Claude Opus 4.6’s coding performance at a fraction of the cost. The model was trained entirely on 100,000 Huawei Ascend 910B chips—no Nvidia hardware. That’s significant for the growing number of organizations that can’t or won’t depend on Nvidia’s supply chain.
The asterisk: all benchmark numbers come from Z.ai’s own documentation. No independent labs have published corroborating results yet. And at 744B total parameters, this isn’t running on your desktop. You’re looking at serious datacenter hardware even with the MoE architecture keeping active params at 40B.
Still, the MIT license makes this a genuinely open contribution. If the benchmarks hold up under independent testing, GLM-5.1 could be the strongest option for self-hosted enterprise inference where hardware isn’t a constraint.
The Real Benchmark: What Runs on a 24 GB GPU
Benchmark tables are nice. Here’s what actually matters if you own a single consumer GPU:
Best all-around quality (24 GB GPU): Gemma 4 31B with the KV cache fix. At 30-34 tok/s on an RTX 3090, it’s now usable for conversational workloads. MMLU Pro 85.2%, AIME 89.2%, and the Arena #3 ranking reflect genuine capability. The Apache 2.0 license means no restrictions.
Best daily driver (any GPU): Qwen 3.5 27B. Still the fastest at its quality tier—35 tok/s on an RTX 4090, significantly faster on smaller contexts. Multilingual support is unmatched. If speed matters more than peak reasoning, this hasn’t been dethroned.
Best lightweight option (16 GB GPU): GPT-OSS 20B. OpenAI’s surprise entry fits on budget hardware and punches above its weight. The 3.6B active parameters keep inference fast while the RL training from o3 gives it unusually strong reasoning for its size.
Best edge device: Gemma 4 E2B. Genuine multimodal intelligence in under 2 GB of memory. Nothing else comes close at this footprint.
Best coding model (if you have the hardware): GLM-5.1, but verify independently. The 94.6% of Opus 4.6 claim needs third-party confirmation.
What Changed This Week
Three things shifted the landscape since last week’s column:
1. The speed gap narrowed. Gemma 4’s KV cache fix turned it from a benchmark curiosity into a daily-usable model on consumer hardware. It’s still not as fast as Qwen 3.5, but 30-34 tok/s is firmly in “good enough” territory for most workloads.
2. The licensing moat disappeared. OpenAI shipping Apache 2.0 models means every major AI lab except Meta now offers permissive licensing. If you’re choosing a model for a commercial product, the licensing conversation is essentially over. Pick on capability and speed, not legal risk.
3. The MoE consensus solidified. Every significant release this quarter uses Mixture-of-Experts. The implication for local users: total parameter count matters for download size and VRAM, but active parameters determine inference speed. A 120B MoE model activating 5B params can be faster than a 31B dense model. Read the active parameter count first.
What to Watch Next Week
Qwen 3.6 Plus weights. If Alibaba releases them, the 1M context window and 3x speed advantage could reshuffle the consumer rankings overnight.
Independent GLM-5.1 benchmarks. Z.ai’s claims are bold. Third-party validation will either confirm a new coding champion or expose inflated numbers.
The vLLM FlashAttention patch for Gemma 4. The KV cache fix helped local users. A FlashAttention fix would unlock enterprise serving speeds. The vLLM issue tracker is the one to watch.
The open-weight field went from “pick the least bad option” to “which excellent option fits your hardware?” in roughly six weeks. The speed war is the only race that matters now—benchmarks stopped being the differentiator the moment everyone crossed the quality threshold. Whoever runs fastest on your specific GPU wins your daily workflow, and right now, that race is genuinely close.