Last week, NVIDIA’s Nemotron 3 Nano landed with a fundamentally different architecture and throughput numbers that made everything else in its weight class look sluggish. This week, Alibaba responded. Qwen3.6-35B-A3B dropped on April 14 under Apache 2.0, and its benchmark numbers are hard to ignore—73.4% on SWE-Bench Verified, 85.2% on MMLU-Pro, and a MCPMark tool-use score that doubles Gemma 4’s.
The ~3B active parameter class—the models that actually fit on a single consumer GPU and run fast—now has three serious competitors with fundamentally different strengths. The era of one obvious pick is over.
Qwen3.6-35B-A3B: The Numbers
Alibaba’s latest is a sparse Mixture-of-Experts model: 35 billion total parameters, roughly 3 billion active per token. It’s a pure Transformer MoE—no architectural experiments, just a well-tuned version of what the Qwen team has been refining for three generations.
The published benchmarks are aggressive:
| Benchmark | Qwen3.6-35B-A3B | Nemotron 3 Nano | Gemma 4 26B-A4B | GPT-OSS 20B |
|---|---|---|---|---|
| MMLU-Pro | 85.2% | 78.3% | ~77%* | ~74%* |
| SWE-Bench Verified | 73.4% | — | — | — |
| LiveCodeBench v6 | ~66% | 68.3% | ~69% | 61.0% |
| GPQA (science reasoning) | 86.0 | — | — | — |
| MCPMark (tool use) | 37.0% | — | 18.1% | — |
*Estimated from published ranges. Dashes indicate unpublished results for that specific benchmark.
Two things jump out. First, the SWE-Bench Verified score of 73.4% puts this model in territory normally reserved for frontier proprietary models—Claude Sonnet-class performance on real-world software engineering tasks, from a model that fits on a MacBook. Second, the MCPMark tool-use score is more than double Gemma 4’s, suggesting Qwen3.6 was specifically trained for the agentic coding workflows that are rapidly becoming the primary use case for local models.
These are vendor-reported numbers. Independent verification is pending. But Alibaba’s track record with Qwen3.5 benchmarks was solid—the community largely confirmed those scores—so there’s reason for cautious optimism.
Running It Locally
The practical deployment story is good. The Q4_K_M GGUF from Unsloth weighs about 20.9 GB. That fits comfortably on an RTX 4090 (24 GB) and runs on a 64 GB MacBook Pro M4/M5.
Speed on consumer hardware:
| Hardware | Quantization | Tokens/sec | Backend |
|---|---|---|---|
| RTX 4090 | Q4_K_M | ~60-100 | llama.cpp |
| RTX 4090 | Q4_K_M | ~15-20 | Ollama |
| MacBook Pro M5 64GB | Q4_K_S | ~35-50 | llama.cpp |
Ollama shipped native support on April 16—run it with ollama run qwen3.6. But note the significant overhead: llama.cpp direct gives 3-5x the throughput of Ollama on the same hardware. If speed matters, skip the convenience layer.
Context window is 262K tokens natively, extendable to roughly 1M with YaRN. That’s slightly shorter than Nemotron 3 Nano’s native 1M but sufficient for most local workflows.
The 3B Active Parameter War
Here’s what makes this week interesting. We now have three competitive models all activating roughly 3-4 billion parameters from larger pools, each optimized for different things:
Qwen3.6-35B-A3B — Best raw benchmark scores, especially on agentic coding (SWE-Bench, MCPMark). Pure Transformer MoE. The quality pick.
Nemotron 3 Nano — Hybrid Mamba-Transformer MoE. Lower benchmarks than Qwen3.6 but 3.3x higher throughput on datacenter hardware and constant-memory generation that keeps speed consistent during long conversations. The speed pick.
Gemma 4 26B-A4B — Apache 2.0, strong on LiveCodeBench, native vision and audio. The most permissive license and the best multimodal option, but slightly behind Qwen3.6 on pure coding benchmarks.
All three fit on a single RTX 4090 with Q4 quantization. All three are Apache 2.0 or equivalently permissive. The differences come down to what you’re optimizing for.
If you’re building agentic coding pipelines that need to pass tests and use tools, Qwen3.6’s SWE-Bench and MCPMark lead matters. If you’re running long interactive sessions or working with massive codebases where sustained throughput beats peak quality, Nemotron 3 Nano’s architectural advantage holds. If you need vision, audio, or the broadest ecosystem compatibility, Gemma 4 is still the safest choice.
Follow-Up: What We Were Watching
Last week’s column flagged three things. Here’s where they stand:
Nemotron 3 Super independent benchmarks: Confirmed by Artificial Analysis—449 output tokens per second, ranked #1 of 51 models in its intelligence tier. The throughput claims held up. But the bad news for consumer users: minimum hardware is 2x H100-80GB GPUs. This is a datacenter model. Nemotron 3 Nano remains the only consumer-viable member of the family.
Nemotron 3 Ultra release: Still unreleased. The 550B-parameter model with 55B active per token is expected sometime in H1 2026. Documentation exists in NVIDIA’s nightly builds, but weights aren’t available yet. When it lands, it’ll be the largest open-weight model downloadable—but you’ll need serious hardware to run it.
The vLLM FlashAttention fix for Gemma 4: Still broken. The dual attention head dimensions (256 for sliding window, 512 for global) continue to force Triton fallback in vLLM. Gemma 4 E4B still crawls at ~9 tok/s on an RTX 4090 through vLLM versus 60+ tok/s through llama.cpp. If you’re serving Gemma 4, llama.cpp for single users or tolerate the latency hit on vLLM for batched serving. This is now six weeks without a fix.
Bonus: GLM-5.1 Gets Partial Validation
One more update worth noting. Zhipu AI’s GLM-5.1—the 744B-parameter open-weight model we’ve been tracking since its March release—received independent confirmation on Code Arena at 1530 Elo, placing it third on the agentic webdev leaderboard. Separate testing pegged it at 58.4 on SWE-Bench Pro, slightly below Z.ai’s internal claim but still strong. GLM-5.1 isn’t a consumer model—at 744B parameters it needs serious infrastructure—but the independent validation is encouraging for the credibility of open-weight benchmark claims generally.
Updated Rankings: What Runs on Your Hardware
| Model | Total Params | Active Params | Context | License | Speed (RTX 4090, Q4) |
|---|---|---|---|---|---|
| Qwen 3.6 35B-A3B | 35B | ~3B | 262K+ | Apache 2.0 | ~60-100 tok/s |
| Nemotron 3 Nano | 31.6B | ~3.6B | 1M | Open | ~50+ tok/s |
| Gemma 4 26B-A4B | 26B | 3.8B | 256K | Apache 2.0 | ~64 tok/s |
| Gemma 4 31B | 31B | 31B | 256K | Apache 2.0 | ~30 tok/s |
| Qwen 3.5 27B | 27B | ~27B | 128K | Apache 2.0 | ~35 tok/s |
| Llama 4 Scout | 109B | 17B | 10M | Community | ~20 tok/s |
| GPT-OSS 120B | 117B | 5.1B | 131K | Apache 2.0 | ~30 tok/s |
| GPT-OSS 20B | 21B | 3.6B | 131K | Apache 2.0 | ~45 tok/s |
Qwen3.6-35B-A3B slots in at the top of the speed column among the ~3B active models when running through llama.cpp directly. The Ollama numbers are significantly lower due to overhead, so backend choice matters as much as model choice for practical performance.
What This Means
The 3B active class is where the real competition is. Forget the 500B datacenter models. For the majority of people running AI locally, the ~3B active parameter MoE models are the sweet spot: they fit on consumer hardware, run at conversational speed, and deliver quality that was frontier-tier 18 months ago. Qwen3.6 entering this space with SWE-Bench Verified scores in the 70s means the bar just went up.
Agentic coding is the new benchmark arms race. SWE-Bench, MCPMark, Terminal-Bench—the benchmarks getting attention aren’t multiple-choice tests anymore. They’re measuring whether a model can actually write code, use tools, and pass test suites. Qwen3.6 was clearly optimized for this use case, and the scores reflect it. Expect Nemotron 3 and Gemma 4 updates to target these same metrics.
Backend choice matters more than model choice. The Qwen3.6 performance gap between Ollama (~20 tok/s) and llama.cpp (~100 tok/s) on the same hardware is a 5x difference. The vLLM Gemma 4 bug creates a similar gap. Picking the right inference backend is now as important as picking the right model.
My Picks This Week
Best all-around for 24 GB GPU: Qwen3.6-35B-A3B takes this from Gemma 4 31B, with a caveat. If the SWE-Bench and MCPMark numbers hold under independent testing, it’s the best coding model you can run locally by a meaningful margin. Use llama.cpp, not Ollama, for full speed. Run it with ollama run qwen3.6 for convenience, but know you’re leaving performance on the table.
Best for sustained long sessions: Nemotron 3 Nano. The Mamba architecture’s constant-memory generation means it doesn’t slow down as conversations grow. If you’re working with 100K+ token contexts regularly, this still matters more than a benchmark gap.
Best lightweight option (16 GB GPU): Nemotron 3 Nano at Q4. The smaller quantizations still fit 16 GB VRAM and the throughput advantage over alternatives at this memory ceiling is real.
Best multimodal local model: Gemma 4 26B-A4B. Still the only competitive option if you need vision and audio input alongside text, with an Apache 2.0 license that makes commercial use straightforward.
What to Watch Next Week
Independent Qwen3.6-35B-A3B benchmarks. The vendor numbers are strong, and Alibaba has a decent track record on benchmark accuracy. But the SWE-Bench Verified score of 73.4% is an extraordinary claim for a 3B-active model. Community reproduction should start appearing this week.
Meta’s hybrid strategy. Reports indicate Meta is developing open-source versions of upcoming models even as Muse Spark stays proprietary. Whether this means a Llama successor or open derivatives of the Avocado/Mango projects could reshape the competitive picture.
Nemotron 3 Ultra timing. NVIDIA’s H1 2026 window is running short. The 550B-parameter model would be the largest downloadable open-weight model and could reset expectations for what open models can do—assuming you have the hardware to run it.
The scorecard: eight models worth tracking, three of them competing head-to-head in the most important weight class for local users, and benchmark claims that need verification. The open-weight field keeps getting better, and the competition keeps getting tighter.