Open-Weight LLM Showdown Week 11: Qwen 3.6 Fires Back, and the 3B Active War Is Real

Last week, NVIDIA’s Nemotron 3 Nano landed with a fundamentally different architecture and throughput numbers that made everything else in its weight class look sluggish. This week, Alibaba responded. Qwen3.6-35B-A3B dropped on April 14 under Apache 2.0, and its benchmark numbers are hard to ignore—73.4% on SWE-Bench Verified, 85.2% on MMLU-Pro, and a MCPMark tool-use score that doubles Gemma 4’s.

The ~3B active parameter class—the models that actually fit on a single consumer GPU and run fast—now has three serious competitors with fundamentally different strengths. The era of one obvious pick is over.

Qwen3.6-35B-A3B: The Numbers

Alibaba’s latest is a sparse Mixture-of-Experts model: 35 billion total parameters, roughly 3 billion active per token. It’s a pure Transformer MoE—no architectural experiments, just a well-tuned version of what the Qwen team has been refining for three generations.

The published benchmarks are aggressive:

Benchmark	Qwen3.6-35B-A3B	Nemotron 3 Nano	Gemma 4 26B-A4B	GPT-OSS 20B
MMLU-Pro	85.2%	78.3%	~77%*	~74%*
SWE-Bench Verified	73.4%	—	—	—
LiveCodeBench v6	~66%	68.3%	~69%	61.0%
GPQA (science reasoning)	86.0	—	—	—
MCPMark (tool use)	37.0%	—	18.1%	—

*Estimated from published ranges. Dashes indicate unpublished results for that specific benchmark.

Two things jump out. First, the SWE-Bench Verified score of 73.4% puts this model in territory normally reserved for frontier proprietary models—Claude Sonnet-class performance on real-world software engineering tasks, from a model that fits on a MacBook. Second, the MCPMark tool-use score is more than double Gemma 4’s, suggesting Qwen3.6 was specifically trained for the agentic coding workflows that are rapidly becoming the primary use case for local models.

These are vendor-reported numbers. Independent verification is pending. But Alibaba’s track record with Qwen3.5 benchmarks was solid—the community largely confirmed those scores—so there’s reason for cautious optimism.

Running It Locally

The practical deployment story is good. The Q4_K_M GGUF from Unsloth weighs about 20.9 GB. That fits comfortably on an RTX 4090 (24 GB) and runs on a 64 GB MacBook Pro M4/M5.

Speed on consumer hardware:

Hardware	Quantization	Tokens/sec	Backend
RTX 4090	Q4_K_M	~60-100	llama.cpp
RTX 4090	Q4_K_M	~15-20	Ollama
MacBook Pro M5 64GB	Q4_K_S	~35-50	llama.cpp

Ollama shipped native support on April 16—run it with ollama run qwen3.6. But note the significant overhead: llama.cpp direct gives 3-5x the throughput of Ollama on the same hardware. If speed matters, skip the convenience layer.

Context window is 262K tokens natively, extendable to roughly 1M with YaRN. That’s slightly shorter than Nemotron 3 Nano’s native 1M but sufficient for most local workflows.

The 3B Active Parameter War

Here’s what makes this week interesting. We now have three competitive models all activating roughly 3-4 billion parameters from larger pools, each optimized for different things:

Qwen3.6-35B-A3B — Best raw benchmark scores, especially on agentic coding (SWE-Bench, MCPMark). Pure Transformer MoE. The quality pick.

Nemotron 3 Nano — Hybrid Mamba-Transformer MoE. Lower benchmarks than Qwen3.6 but 3.3x higher throughput on datacenter hardware and constant-memory generation that keeps speed consistent during long conversations. The speed pick.

Gemma 4 26B-A4B — Apache 2.0, strong on LiveCodeBench, native vision and audio. The most permissive license and the best multimodal option, but slightly behind Qwen3.6 on pure coding benchmarks.

All three fit on a single RTX 4090 with Q4 quantization. All three are Apache 2.0 or equivalently permissive. The differences come down to what you’re optimizing for.

If you’re building agentic coding pipelines that need to pass tests and use tools, Qwen3.6’s SWE-Bench and MCPMark lead matters. If you’re running long interactive sessions or working with massive codebases where sustained throughput beats peak quality, Nemotron 3 Nano’s architectural advantage holds. If you need vision, audio, or the broadest ecosystem compatibility, Gemma 4 is still the safest choice.

Follow-Up: What We Were Watching

Last week’s column flagged three things. Here’s where they stand:

Nemotron 3 Super independent benchmarks: Confirmed by Artificial Analysis—449 output tokens per second, ranked #1 of 51 models in its intelligence tier. The throughput claims held up. But the bad news for consumer users: minimum hardware is 2x H100-80GB GPUs. This is a datacenter model. Nemotron 3 Nano remains the only consumer-viable member of the family.

Nemotron 3 Ultra release: Still unreleased. The 550B-parameter model with 55B active per token is expected sometime in H1 2026. Documentation exists in NVIDIA’s nightly builds, but weights aren’t available yet. When it lands, it’ll be the largest open-weight model downloadable—but you’ll need serious hardware to run it.

The vLLM FlashAttention fix for Gemma 4: Still broken. The dual attention head dimensions (256 for sliding window, 512 for global) continue to force Triton fallback in vLLM. Gemma 4 E4B still crawls at ~9 tok/s on an RTX 4090 through vLLM versus 60+ tok/s through llama.cpp. If you’re serving Gemma 4, llama.cpp for single users or tolerate the latency hit on vLLM for batched serving. This is now six weeks without a fix.

Bonus: GLM-5.1 Gets Partial Validation

One more update worth noting. Zhipu AI’s GLM-5.1—the 744B-parameter open-weight model we’ve been tracking since its March release—received independent confirmation on Code Arena at 1530 Elo, placing it third on the agentic webdev leaderboard. Separate testing pegged it at 58.4 on SWE-Bench Pro, slightly below Z.ai’s internal claim but still strong. GLM-5.1 isn’t a consumer model—at 744B parameters it needs serious infrastructure—but the independent validation is encouraging for the credibility of open-weight benchmark claims generally.

Updated Rankings: What Runs on Your Hardware

Model	Total Params	Active Params	Context	License	Speed (RTX 4090, Q4)
Qwen 3.6 35B-A3B	35B	~3B	262K+	Apache 2.0	~60-100 tok/s
Nemotron 3 Nano	31.6B	~3.6B	1M	Open	~50+ tok/s
Gemma 4 26B-A4B	26B	3.8B	256K	Apache 2.0	~64 tok/s
Gemma 4 31B	31B	31B	256K	Apache 2.0	~30 tok/s
Qwen 3.5 27B	27B	~27B	128K	Apache 2.0	~35 tok/s
Llama 4 Scout	109B	17B	10M	Community	~20 tok/s
GPT-OSS 120B	117B	5.1B	131K	Apache 2.0	~30 tok/s
GPT-OSS 20B	21B	3.6B	131K	Apache 2.0	~45 tok/s

Qwen3.6-35B-A3B slots in at the top of the speed column among the ~3B active models when running through llama.cpp directly. The Ollama numbers are significantly lower due to overhead, so backend choice matters as much as model choice for practical performance.

What This Means

The 3B active class is where the real competition is. Forget the 500B datacenter models. For the majority of people running AI locally, the ~3B active parameter MoE models are the sweet spot: they fit on consumer hardware, run at conversational speed, and deliver quality that was frontier-tier 18 months ago. Qwen3.6 entering this space with SWE-Bench Verified scores in the 70s means the bar just went up.

Agentic coding is the new benchmark arms race. SWE-Bench, MCPMark, Terminal-Bench—the benchmarks getting attention aren’t multiple-choice tests anymore. They’re measuring whether a model can actually write code, use tools, and pass test suites. Qwen3.6 was clearly optimized for this use case, and the scores reflect it. Expect Nemotron 3 and Gemma 4 updates to target these same metrics.

Backend choice matters more than model choice. The Qwen3.6 performance gap between Ollama (~20 tok/s) and llama.cpp (~100 tok/s) on the same hardware is a 5x difference. The vLLM Gemma 4 bug creates a similar gap. Picking the right inference backend is now as important as picking the right model.

My Picks This Week

Best all-around for 24 GB GPU: Qwen3.6-35B-A3B takes this from Gemma 4 31B, with a caveat. If the SWE-Bench and MCPMark numbers hold under independent testing, it’s the best coding model you can run locally by a meaningful margin. Use llama.cpp, not Ollama, for full speed. Run it with ollama run qwen3.6 for convenience, but know you’re leaving performance on the table.

Best for sustained long sessions: Nemotron 3 Nano. The Mamba architecture’s constant-memory generation means it doesn’t slow down as conversations grow. If you’re working with 100K+ token contexts regularly, this still matters more than a benchmark gap.

Best lightweight option (16 GB GPU): Nemotron 3 Nano at Q4. The smaller quantizations still fit 16 GB VRAM and the throughput advantage over alternatives at this memory ceiling is real.

Best multimodal local model: Gemma 4 26B-A4B. Still the only competitive option if you need vision and audio input alongside text, with an Apache 2.0 license that makes commercial use straightforward.

What to Watch Next Week

Independent Qwen3.6-35B-A3B benchmarks. The vendor numbers are strong, and Alibaba has a decent track record on benchmark accuracy. But the SWE-Bench Verified score of 73.4% is an extraordinary claim for a 3B-active model. Community reproduction should start appearing this week.

Meta’s hybrid strategy. Reports indicate Meta is developing open-source versions of upcoming models even as Muse Spark stays proprietary. Whether this means a Llama successor or open derivatives of the Avocado/Mango projects could reshape the competitive picture.

Nemotron 3 Ultra timing. NVIDIA’s H1 2026 window is running short. The 550B-parameter model would be the largest downloadable open-weight model and could reset expectations for what open models can do—assuming you have the hardware to run it.

The scorecard: eight models worth tracking, three of them competing head-to-head in the most important weight class for local users, and benchmark claims that need verification. The open-weight field keeps getting better, and the competition keeps getting tighter.