For the past month, this column has tracked the rise of sparse Mixture-of-Experts models as the path forward for local AI. Activate 3 billion parameters from a 35 billion pool. Get frontier-class quality at consumer-hardware speeds. The MoE narrative was clean, compelling, and increasingly looking like settled science.
Then Alibaba dropped Qwen3.6-27B on April 22—a fully dense 27-billion parameter model that beats their own 397-billion MoE on every major coding benchmark. The efficiency story just got a lot more complicated.
Qwen3.6-27B: Dense and Dangerous
The numbers are striking. This is a straightforward dense Transformer—no routing, no expert selection, no MoE tricks. Every parameter fires on every token. And somehow it beats the previous-gen flagship that had 14.8 times more parameters.
| Benchmark | Qwen3.6-27B (Dense) | Qwen3.5-397B-A17B (MoE) | Qwen3.6-35B-A3B (MoE) |
|---|---|---|---|
| SWE-Bench Verified | 77.2% | 76.2% | 73.4% |
| SWE-Bench Pro | 53.5% | 50.9% | — |
| Terminal-Bench 2.0 | 59.3% | 52.5% | 51.5% |
| SkillsBench | 48.2% | 30.0% | — |
| AIME 2026 | 92.7% | — | 92.7% |
That SWE-Bench Verified score of 77.2% puts it within 3.7 points of Claude Opus 4.6 (80.8%). From a model you can download and run on a single GPU. The SkillsBench gap is even more dramatic—a 77% relative improvement over the 397B model with a fraction of the parameters.
The caveat: Qwen reports these benchmarks using their own internal agent scaffold (bash + file-edit tools), and independent third-party reproductions outside that scaffolding are limited as of this writing. The numbers need verification. But Alibaba’s Qwen3.5 benchmarks held up to community scrutiny, so the track record is decent.
Running It: The Dense Advantage
Dense models have a practical advantage that MoE architecture can’t match: simplicity. No routing logic, no expert load-balancing issues, no weird quantization interactions with sparse experts. You download it, quantize it, run it.
VRAM requirements at common quantization levels:
| Quantization | Model Size | Min VRAM | Best GPU |
|---|---|---|---|
| Q4_K_M | ~16.8 GB | 20 GB | RTX 4090 / Mac M4 Pro 24GB |
| Q5 | ~19.5 GB | 22 GB | RTX 4090 |
| Q6 | ~22.5 GB | 24 GB | RTX 3090 / 4090 |
| Q8 | ~28.6 GB | 32 GB | RTX 5090 / Mac M4 Max 36GB |
At Q4_K_M, the 27B dense model fits comfortably on an RTX 4090 with room for a decent context window. On Apple Silicon, a 24GB Mac runs it at Q4 with some headroom.
The tradeoff versus the MoE sibling (35B-A3B) is speed versus quality. The dense model activates all 27 billion parameters per token. The MoE model activates roughly 3 billion. On equivalent hardware, the MoE model generates tokens significantly faster—100+ tok/s on an RTX 3090 via llama.cpp at Q4, versus roughly 30-40 tok/s for the dense 27B at comparable quantization. But the dense model’s benchmark scores are meaningfully higher, especially on agentic coding tasks.
Qwen3.6-35B-A3B: One Week Later
Speaking of the MoE sibling, last week we flagged the need for independent verification of its 73.4% SWE-Bench Verified score. Here’s where that stands.
Community testing on RTX 3090 hardware has been extensive. The performance numbers are confirmed: 101.7 tok/s at short prompts, 80.9 tok/s at long prompts using UD-Q4_K_XL quantization. Unsloth’s internal evals show Q4_K_M retaining roughly 99% of BF16 performance on code-gen benchmarks—quantization barely dents quality.
The quality claims are harder to pin down. Independent SWE-Bench reproduction is still pending, and there’s some healthy skepticism in the community about whether the 3.6 scores are meaningfully different from 3.5 outside Qwen’s own scaffolding. The safest conclusion: the model is genuinely fast and genuinely good for local coding work. Whether it’s 73.4% or a few points lower on SWE-Bench Verified doesn’t change the practical recommendation.
The Dense vs. MoE Decision
This is the real story this week. For the first time, the same vendor is shipping both a dense and sparse model targeting the same hardware tier, and the dense model wins on quality while the sparse model wins on speed. That forces a real choice:
Pick Qwen3.6-27B (Dense) if:
- You’re doing complex agentic coding tasks where quality per response matters more than speed
- You want simpler deployment with no MoE routing complexity
- You have 24GB+ VRAM and can tolerate 30-40 tok/s
- You’re fine-tuning (dense models are dramatically easier to fine-tune than MoE)
Pick Qwen3.6-35B-A3B (MoE) if:
- You need fast interactive responses (100+ tok/s)
- You’re running on 16GB VRAM (the 3B active footprint is much lighter)
- You’re building pipelines where throughput matters more than per-request quality
- You want the best multimodal support (the A3B model includes a vision encoder)
Neither choice is wrong. But the existence of both models from the same team destroys the narrative that MoE is always the better architecture for local inference. Sometimes you want all 27 billion parameters firing.
Follow-Up: What We Were Watching
vLLM Gemma 4 FlashAttention fix: Still open. Seven weeks now. PR #38891 is in progress with a per-layer attention backend approach—FlashAttention for the 83% of layers with 256-dim heads, Triton fallback only for the 7 global attention layers with 512-dim heads. Not merged yet. Gemma 4 on vLLM remains frustratingly slow. Use llama.cpp if you’re serving Gemma 4 locally.
Nemotron 3 Ultra: Still unreleased. NVIDIA’s H1 2026 window has about two months left. Documentation is there; weights are not. The 550B-parameter model remains vaporware for now.
Independent Qwen3.6-35B-A3B benchmarks: Partially confirmed. Speed numbers are solid. Quality benchmarks are being tested but full independent SWE-Bench reproduction hasn’t landed yet. The community consensus so far: it’s very good, but pinning down exact benchmark numbers outside Qwen’s own tooling is proving difficult.
Updated Rankings: What Runs on Your Hardware
| Model | Type | Total / Active Params | Context | SWE-Bench V. | Speed (RTX 4090, Q4) |
|---|---|---|---|---|---|
| Qwen 3.6 27B | Dense | 27B / 27B | 262K+ | 77.2% | ~30-40 tok/s |
| Qwen 3.6 35B-A3B | MoE | 35B / ~3B | 262K+ | 73.4% | ~60-100 tok/s |
| Nemotron 3 Nano | MoE | 31.6B / ~3.6B | 1M | — | ~50+ tok/s |
| Gemma 4 26B-A4B | MoE | 26B / 3.8B | 256K | — | ~40-50 tok/s |
| Gemma 4 31B | Dense | 31B / 31B | 256K | — | ~30 tok/s |
| Llama 4 Scout | MoE | 109B / 17B | 10M | — | ~20 tok/s |
| GPT-OSS 120B | MoE | 117B / 5.1B | 131K | — | ~30 tok/s |
| GPT-OSS 20B | MoE | 21B / 3.6B | 131K | — | ~45 tok/s |
Qwen now occupies both of the top two spots. That’s new.
What This Means
Dense isn’t dead. The MoE hype suggested dense models were a dead end for local inference—too slow, too much memory. Qwen3.6-27B pushes back hard. At Q4_K_M, it fits in the same VRAM envelope as many MoE models and delivers notably better quality on the hardest benchmarks. If your workload values quality over throughput, dense wins.
The Qwen team is pulling away. Between the 27B dense model, the 35B-A3B MoE, and the Qwen3.6-Max cloud model (which leads six coding benchmarks), Alibaba’s Qwen team is the most prolific and arguably most competitive open-weight lab right now. Google’s Gemma 4 is solid but playing catch-up on agentic coding metrics. NVIDIA’s Nemotron 3 Nano remains the speed champion but hasn’t updated since launch.
Benchmark scaffolding matters. The fact that Qwen reports benchmarks using their own agent scaffold, and independent reproduction outside that scaffold is limited, is worth paying attention to. It doesn’t mean the numbers are wrong—it means the numbers describe performance in a specific environment. Your mileage in other frameworks may vary. We need more community testing with standardized scaffolding.
My Picks This Week
Best for coding quality (24 GB GPU): Qwen3.6-27B. If the SWE-Bench numbers hold up—and the SkillsBench improvement over the 397B model is particularly compelling—this is the best model you can run locally for writing, debugging, and testing code. The dense architecture makes deployment straightforward.
Best for speed (24 GB GPU): Qwen3.6-35B-A3B. Still the fastest option with competitive quality. Use llama.cpp, not Ollama, for full throughput. If you’re building automated pipelines where you’re making dozens of model calls, the 3x speed advantage over the dense sibling adds up fast.
Best for sustained long sessions: Nemotron 3 Nano. The Mamba architecture’s constant-memory generation hasn’t been matched. For 100K+ token context windows, this is still the pick.
Best multimodal (16 GB GPU): Gemma 4 26B-A4B. The only competitive option combining vision, audio, and text with a permissive Apache 2.0 license, and the MoE architecture fits in 16 GB. Once the vLLM fix lands, this gets even more attractive for serving.
What to Watch Next Week
Qwen3.6-27B independent benchmarks. The 77.2% SWE-Bench Verified claim is the biggest number in the column right now. Community reproduction should start this week. If it holds outside Qwen’s scaffolding, this model reshapes the local AI conversation.
ICLR 2026. The conference runs April 24-28 in Rio de Janeiro. New research on efficient architectures, quantization techniques, and inference optimization could influence the next generation of local models. Watch for papers on hybrid dense-MoE approaches—Qwen’s dual-model strategy suggests this is a productive design space.
vLLM Gemma 4 fix timeline. PR #38891 is the most-watched fix in the local AI community. Seven weeks is a long time for a performance regression of this magnitude. If it merges, Gemma 4 becomes significantly more competitive for serving workloads.
The scorecard: nine models worth tracking, two from the same team at the top, and a fundamental question about whether dense or sparse is the right architecture for your workload. The answer, for the first time, is genuinely “it depends.”