Open-Weight LLM Showdown Week 5: Qwen 3.5 Dominates, Nemotron 3 Super Redefines Efficiency

The open-weight leaderboard reshuffled this week. Qwen 3.5’s MoE variants are pushing frontier-class performance onto consumer hardware, NVIDIA’s Nemotron 3 Super is delivering 5x throughput improvements over its predecessor, and GLM-4.7-Flash is proving that 30B models can smoke larger competitors on coding benchmarks.

Here’s what actually matters for running models locally.

Qwen 3.5: MoE Done Right

Alibaba’s Qwen team released the full Qwen 3.5 family over the past month, and the benchmarks are hard to ignore. The flagship 397B-A17B model uses sparse Mixture-of-Experts—397 billion total parameters, but only 17 billion active per forward pass.

The numbers that matter:

91.3% on AIME 2026 (mathematical reasoning)
88.4% on GPQA Diamond (the highest among open models)
76.5% on IFBench (beating GPT-5.2’s 75.4%)
Apache 2.0 license across the entire family

But the flagship isn’t what’s exciting for local users. The 35B-A3B variant activates only 3 billion parameters per token while maintaining competitive benchmarks. That’s frontier-adjacent performance on an RTX 4060.

Consumer Hardware Performance

Model	RTX 4090 (Q4)	Mac M4 Max	VRAM Required
Qwen3.5-27B	~21 t/s	~21 t/s	16GB (8-bit)
Qwen3.5-35B-A3B	60-100 t/s	65+ t/s	12GB (4-bit)
Qwen3.5-122B-A10B	25-35 t/s	N/A	48GB+

The 35B-A3B is the sweet spot for most users—it fits in 12GB VRAM at 4-bit quantization and runs fast enough for interactive use. The dense 27B is slower but scores higher on nuanced reasoning tasks.

Nemotron 3 Super: Agentic Throughput Monster

NVIDIA announced Nemotron 3 Super at GTC, and it’s designed specifically for agentic workloads. The architecture is a hybrid Mamba-Transformer MoE—120B total parameters with 12B active.

What makes it stand out:

60.47% on SWE-Bench Verified (real-world coding)
1M token context window
2.2x higher throughput than GPT-OSS-120B
Multi-token prediction for 3x faster inference

The throughput numbers are particularly impressive. Median output across providers hits 414.6 tokens per second—well above the 81.8 t/s average for similar-sized models. That’s 5x faster than the previous Nemotron Super.

For local deployment, Nemotron 3 Super needs 64GB of unified memory (RAM or VRAM). That’s Mac Studio territory or dual high-end GPUs. Not consumer-friendly yet, but the architecture innovations will filter down.

GLM-4.7-Flash: Coding Benchmark Crusher

Zhipu AI’s GLM-4.7-Flash continues to impress a month after release. The 30B MoE model is optimized specifically for coding tasks, and the benchmarks reflect that focus.

The standout result: 59.2% on SWE-Bench Verified. For comparison, Qwen3-30B scores 22% and GPT-OSS-20B scores 34%. That’s a massive gap for real-world software engineering tasks.

Local Performance

Community testing shows GLM-4.7-Flash hitting:

82 t/s on M4 Max MacBook Pro
60-80 t/s on RTX 3090/4090
~18GB VRAM required (4-bit quantization)

The speed advantage comes from aggressive MoE sparsity. If you’re primarily running coding tasks, GLM-4.7-Flash is currently the best performer in the 24GB VRAM tier.

Gemma 3 QAT: Google’s Quantization Play

Google’s Quantization-Aware Training approach for Gemma 3 deserves attention. The 27B model drops from 54GB (BF16) to just 14.1GB with INT4 quantization while preserving near-original quality.

A single RTX 3090 can now run Gemma 3 27B with room to spare for KV cache. That’s a significant accessibility improvement.

The catch: Gemma 3 trails the Chinese models on most benchmarks. It’s a solid general-purpose option, but if you care about leaderboard position, Qwen 3.5 and GLM-4.7 are ahead.

The New Local AI Tier List

Based on this week’s benchmarks and community testing:

24GB VRAM (RTX 3090/4090)

Best For	Model	Speed	Why
Coding	GLM-4.7-Flash	60-80 t/s	59.2% SWE-Bench Verified
General	Qwen3.5-35B-A3B	60-100 t/s	Best reasoning per VRAM
Quality	Qwen3.5-27B (8-bit)	20-25 t/s	Dense model, no MoE shortcuts
Speed	Mistral Small 4 (Q4)	40-60 t/s	256K context, multimodal

12-16GB VRAM (RTX 4070/4080)

Best For	Model	Speed	Why
Balanced	Qwen3.5-35B-A3B (4-bit)	40-60 t/s	MoE efficiency shines
Coding	GLM-4.7-Flash (4-bit)	45-60 t/s	Still dominates SWE-Bench
General	Gemma 3 27B QAT	20-30 t/s	Runs in 14.1GB

Mac Unified Memory (32GB+)

The M4 Max is competitive with discrete GPUs for quantized models. Qwen3.5-27B and GLM-4.7-Flash both hit 60+ t/s on Metal, making Apple Silicon a legitimate local AI platform.

What This Means

The gap between open-weight and closed models continues to narrow. Current leaderboards show S-tier open models like GLM-4.7, Kimi K2.5, and MiniMax M2.5 matching or exceeding proprietary performance on specific benchmarks.

For local users, the practical takeaways:

Qwen 3.5’s MoE models are the new default for general-purpose local inference
GLM-4.7-Flash owns the coding niche if you have 24GB VRAM
Nemotron 3 Super’s architecture is the future—watch for distilled versions
Gemma 3 QAT makes Google’s model viable on consumer hardware

The hardware requirement wall keeps dropping. A year ago, running anything competitive required H100s. Now an RTX 4090 matches enterprise GPUs on 70B inference. Next year, expect the same for 100B+ models.

What You Can Do

If you have 24GB VRAM: Run GLM-4.7-Flash for coding, Qwen3.5-35B-A3B for everything else. Both hit 60+ t/s and compete with closed models on benchmarks.

If you have 12-16GB VRAM: Qwen3.5-35B-A3B at 4-bit is your best option. The MoE architecture means minimal quality loss from quantization.

If you’re on Mac: M4 Max or better gives you competitive performance. MLX optimizations for Qwen 3.5 are mature.

The open-weight ecosystem just had its strongest month since Llama 2 dropped. Check the home GPU leaderboard for current rankings—they’re updating weekly as new benchmarks arrive.