Last week, Qwen owned the top two spots in our rankings. This week, the field got crowded. DeepSeek dropped its first major model in 484 days, Gemma 4 proved its benchmark claims with independent testing, and ICLR 2026 wrapped up in Rio de Janeiro with papers that could reshape how we run models locally.
DeepSeek V4: The 484-Day Comeback
DeepSeek had been quiet since V3. Not anymore. On April 24, they shipped two models under MIT license: V4-Pro (1.6 trillion parameters, 49 billion active) and V4-Flash (284 billion parameters, 13 billion active). Both support a one-million-token context window with up to 384,000 tokens of output.
V4-Pro currently leads the BenchLM open-weight leaderboard at 87 overall. That puts it ahead of every other open model by a meaningful margin.
The local inference picture is mixed. V4-Pro at 1.6T parameters isn’t realistic for consumer hardware—you’d need a multi-GPU server even with aggressive quantization. But V4-Flash at 284B with 13B active parameters is more interesting. With 4-bit quantization, it might fit on a high-end workstation setup, though practical benchmarks on consumer hardware haven’t materialized yet.
For this column, V4-Flash is the one to watch. If the community can get it running efficiently via llama.cpp or similar frameworks, a 13B-active MoE under MIT license would be a significant new option for local deployment.
Gemma 4: From Hype to Hardware
Google released Gemma 4 almost a month ago. Now we have enough independent data to judge it properly.
The 31B dense model ranks #3 on the Arena AI text leaderboard among open models. The 26B MoE holds #6. Both outperform models with 20 times more parameters. Those aren’t Google’s numbers—that’s community voting.
The benchmark scores are holding up under independent scrutiny:
| Benchmark | Gemma 4 31B | Llama 4 Scout | DeepSeek V4 |
|---|---|---|---|
| MMLU Pro | 85.2% | — | — |
| AIME 2026 | 89.2% | 88.3% | 42.5% |
| LiveCodeBench v6 | 80.0% | 77.1% | — |
| GPQA Diamond | 84.3% | 82.3% | 58.6% |
The practical story: the 26B MoE runs on 16GB VRAM. The 31B dense needs 24GB+. Both under Apache 2.0—still the most permissive license among the top-tier open models.
Gemma 4’s weakness remains agentic coding. On SWE-Bench Verified and Terminal-Bench, Qwen3.6-27B still leads by a comfortable margin. But for math, science reasoning, and general-purpose tasks, Gemma 4 is now the best Apache-licensed option you can run locally.
Qwen3.6-27B: Still Waiting on Independent Verification
Last week’s biggest question was whether Qwen3.6-27B’s 77.2% SWE-Bench Verified score would hold up outside Qwen’s own scaffolding. One week later: still waiting. Independent reproduction on SWE-Bench using third-party agent frameworks hasn’t landed yet.
The community speed numbers are solid. On an RTX 4090, expect 30-40 tok/s at Q4_K_M for the dense 27B, versus 60-100 tok/s for the MoE sibling (35B-A3B). These align with last week’s reports.
What we can say: people are using both Qwen3.6 models for real coding work and reporting positive results. Whether the formal benchmark numbers reproduce exactly outside Qwen’s tooling is a separate question from whether the models are good. They are.
ICLR 2026: What Matters for Local AI
The conference ran April 24-28 in Rio. Three developments are directly relevant to running models on your own hardware.
Google’s TurboQuant tackles KV cache—the memory overhead that limits how much context you can feed a local model. The algorithm significantly reduces KV cache memory without meaningful quality loss. If this gets integrated into inference frameworks, longer context windows become practical on consumer GPUs.
Apple’s ParaRNN received an Oral presentation—the conference’s highest honor. It achieves a 665x speedup for training non-linear RNNs for language models. RNNs are naturally suited to efficient inference, requiring far less memory than attention-based architectures. Apple also demonstrated local LLM inference on M5 Max hardware using MLX, running a quantized frontier coding model entirely within Xcode.
Quantization advances: Multiple papers pushed the state of the art. SliderQuant improves post-training quantization accuracy. ECF8 (Exponent-Concentrated FP8) yields up to 26.9% memory savings with throughput gains up to 177.1%, scaling losslessly to 671B parameters. UniQL unifies quantization and low-rank compression specifically for edge deployment.
The through-line: the research community is increasingly focused on making big models run on small hardware. That’s good news for this column.
Follow-Up: What We Were Watching
vLLM Gemma 4 FlashAttention fix: Still not merged. Eight weeks now. PR #38891’s per-layer attention backend approach—FlashAttention for the 83% of layers with 256-dim heads, Triton fallback for the 7 global attention layers—is technically sound but hasn’t crossed the finish line. A new issue (#40677) reported Gemma 4 failing on Blackwell SM120 hardware with FlashInfer, adding to the backlog. If you’re serving Gemma 4 locally, llama.cpp remains the pragmatic choice.
Nemotron 3 Ultra: Still unreleased. NVIDIA’s H1 2026 deadline gives them two months. The 500B-parameter model with ~50B active per token would be the largest open-weight model available if it ships. But until weights are on Hugging Face, it stays in the vaporware column.
Qwen3.6-27B independent benchmarks: Still pending. The community is using the model and finding it good. Formal SWE-Bench reproduction outside Qwen’s scaffolding hasn’t happened yet. Two weeks without independent verification is starting to be notable.
Updated Rankings: What Runs on Your Hardware
| Model | Type | Total / Active Params | Context | Best Benchmark | Speed (RTX 4090, Q4) |
|---|---|---|---|---|---|
| Qwen 3.6 27B | Dense | 27B / 27B | 262K+ | SWE-Bench V. 77.2%* | ~30-40 tok/s |
| Gemma 4 31B | Dense | 31B / 31B | 256K | AIME 89.2%, Arena #3 | ~30 tok/s |
| Qwen 3.6 35B-A3B | MoE | 35B / ~3B | 262K+ | SWE-Bench V. 73.4%* | ~60-100 tok/s |
| Gemma 4 26B-A4B | MoE | 26B / 3.8B | 256K | Arena #6 | ~40-50 tok/s |
| Nemotron 3 Nano | MoE | 31.6B / ~3.6B | 1M | — | ~50+ tok/s |
| DeepSeek V4-Flash | MoE | 284B / 13B | 1M | — | TBD |
| Llama 4 Scout | MoE | 109B / 17B | 10M | — | ~20 tok/s |
| GPT-OSS 120B | MoE | 117B / 5.1B | 131K | — | ~30 tok/s |
| GPT-OSS 20B | MoE | 21B / 3.6B | 131K | — | ~45 tok/s |
*Qwen’s internal scaffold; independent reproduction pending.
New this week: DeepSeek V4-Flash enters the rankings as a wildcard. At 13B active parameters under MIT license, it’s theoretically deployable on workstation hardware—but we need community benchmarks on consumer GPUs before it earns a proper speed rating.
Gemma 4 31B moves up. Independent Arena AI rankings and confirmed benchmark scores give it the credibility that was still forming last week. For non-coding workloads on 24GB hardware, it’s now the top pick.
What This Means
Three viable ecosystems, three licenses. Qwen (Apache 2.0), Gemma (Apache 2.0), and DeepSeek (MIT) are all shipping competitive open-weight models with permissive licenses. Llama 4 is good but its custom license limits commercial use cases. For anyone building products on local models, the licensing picture has never been better.
DeepSeek’s scale is a wildcard. V4-Pro at 1.6T parameters is a research artifact for most people—you won’t run it at home. But V4-Flash at 284B/13B-active could be the first model that brings near-frontier quality to a workstation GPU. The MIT license sweetens the deal. We need the community quantization and benchmarking cycle to play out before drawing conclusions.
The research pipeline favors local inference. TurboQuant for KV cache compression, ParaRNN for efficient architectures, SliderQuant and ECF8 for better quantization—ICLR 2026’s papers are overwhelmingly pointed at making inference cheaper and smaller. These improvements won’t show up in consumer frameworks overnight, but the direction of travel is clear: running powerful models locally is going to keep getting easier.
My Picks This Week
Best for coding quality (24 GB GPU): Qwen3.6-27B. The SWE-Bench numbers remain unverified independently, but practical user reports are consistently positive. Dense architecture means simple deployment. Until someone shows it doesn’t deliver, it holds the top coding spot.
Best for general intelligence (24 GB GPU): Gemma 4 31B. Arena AI #3 among open models, confirmed AIME and GPQA scores, Apache 2.0 license. If your workload is more reasoning and analysis than coding, this is the pick.
Best for speed (16-24 GB GPU): Qwen3.6-35B-A3B. Still the fastest competitive model on consumer hardware. 100+ tok/s on an RTX 3090 at Q4 is hard to argue with.
Best for sustained long sessions: Nemotron 3 Nano. Constant-memory generation via Mamba architecture remains unmatched for 100K+ token contexts.
Wildcard to watch: DeepSeek V4-Flash. If it runs efficiently on consumer GPUs, it reshuffles the entire ranking. MIT license, 1M context, 13B active parameters. The specs are right. We just need the benchmarks.
What to Watch Next Week
DeepSeek V4-Flash community quantizations. GGUF and AWQ quantizations should start appearing. How much VRAM does it actually need? How fast is it? These numbers will determine whether it’s a real contender for local deployment or just a cloud-only model with downloadable weights.
Qwen3.6-27B independent SWE-Bench results. Two weeks without third-party reproduction is getting long. The model is clearly good—but if the benchmark numbers were significantly inflated by the scaffolding, that matters for comparative rankings.
ICLR fallout. Papers are one thing; implementations are another. Watch for TurboQuant or SliderQuant integration into llama.cpp, vLLM, or MLX. That’s when the conference results start helping real users.
The scorecard: ten models tracked, three permissive-license ecosystems competing hard, and DeepSeek’s comeback adding a new variable to an already competitive field. The best part? Every week, running these models at home gets a little easier.