The open-weight leaderboard reshuffled this week. Qwen 3.5’s MoE variants are pushing frontier-class performance onto consumer hardware, NVIDIA’s Nemotron 3 Super is delivering 5x throughput improvements over its predecessor, and GLM-4.7-Flash is proving that 30B models can smoke larger competitors on coding benchmarks.
Here’s what actually matters for running models locally.
Qwen 3.5: MoE Done Right
Alibaba’s Qwen team released the full Qwen 3.5 family over the past month, and the benchmarks are hard to ignore. The flagship 397B-A17B model uses sparse Mixture-of-Experts—397 billion total parameters, but only 17 billion active per forward pass.
The numbers that matter:
- 91.3% on AIME 2026 (mathematical reasoning)
- 88.4% on GPQA Diamond (the highest among open models)
- 76.5% on IFBench (beating GPT-5.2’s 75.4%)
- Apache 2.0 license across the entire family
But the flagship isn’t what’s exciting for local users. The 35B-A3B variant activates only 3 billion parameters per token while maintaining competitive benchmarks. That’s frontier-adjacent performance on an RTX 4060.
Consumer Hardware Performance
| Model | RTX 4090 (Q4) | Mac M4 Max | VRAM Required |
|---|---|---|---|
| Qwen3.5-27B | ~21 t/s | ~21 t/s | 16GB (8-bit) |
| Qwen3.5-35B-A3B | 60-100 t/s | 65+ t/s | 12GB (4-bit) |
| Qwen3.5-122B-A10B | 25-35 t/s | N/A | 48GB+ |
The 35B-A3B is the sweet spot for most users—it fits in 12GB VRAM at 4-bit quantization and runs fast enough for interactive use. The dense 27B is slower but scores higher on nuanced reasoning tasks.
Nemotron 3 Super: Agentic Throughput Monster
NVIDIA announced Nemotron 3 Super at GTC, and it’s designed specifically for agentic workloads. The architecture is a hybrid Mamba-Transformer MoE—120B total parameters with 12B active.
What makes it stand out:
- 60.47% on SWE-Bench Verified (real-world coding)
- 1M token context window
- 2.2x higher throughput than GPT-OSS-120B
- Multi-token prediction for 3x faster inference
The throughput numbers are particularly impressive. Median output across providers hits 414.6 tokens per second—well above the 81.8 t/s average for similar-sized models. That’s 5x faster than the previous Nemotron Super.
For local deployment, Nemotron 3 Super needs 64GB of unified memory (RAM or VRAM). That’s Mac Studio territory or dual high-end GPUs. Not consumer-friendly yet, but the architecture innovations will filter down.
GLM-4.7-Flash: Coding Benchmark Crusher
Zhipu AI’s GLM-4.7-Flash continues to impress a month after release. The 30B MoE model is optimized specifically for coding tasks, and the benchmarks reflect that focus.
The standout result: 59.2% on SWE-Bench Verified. For comparison, Qwen3-30B scores 22% and GPT-OSS-20B scores 34%. That’s a massive gap for real-world software engineering tasks.
Local Performance
Community testing shows GLM-4.7-Flash hitting:
- 82 t/s on M4 Max MacBook Pro
- 60-80 t/s on RTX 3090/4090
- ~18GB VRAM required (4-bit quantization)
The speed advantage comes from aggressive MoE sparsity. If you’re primarily running coding tasks, GLM-4.7-Flash is currently the best performer in the 24GB VRAM tier.
Gemma 3 QAT: Google’s Quantization Play
Google’s Quantization-Aware Training approach for Gemma 3 deserves attention. The 27B model drops from 54GB (BF16) to just 14.1GB with INT4 quantization while preserving near-original quality.
A single RTX 3090 can now run Gemma 3 27B with room to spare for KV cache. That’s a significant accessibility improvement.
The catch: Gemma 3 trails the Chinese models on most benchmarks. It’s a solid general-purpose option, but if you care about leaderboard position, Qwen 3.5 and GLM-4.7 are ahead.
The New Local AI Tier List
Based on this week’s benchmarks and community testing:
24GB VRAM (RTX 3090/4090)
| Best For | Model | Speed | Why |
|---|---|---|---|
| Coding | GLM-4.7-Flash | 60-80 t/s | 59.2% SWE-Bench Verified |
| General | Qwen3.5-35B-A3B | 60-100 t/s | Best reasoning per VRAM |
| Quality | Qwen3.5-27B (8-bit) | 20-25 t/s | Dense model, no MoE shortcuts |
| Speed | Mistral Small 4 (Q4) | 40-60 t/s | 256K context, multimodal |
12-16GB VRAM (RTX 4070/4080)
| Best For | Model | Speed | Why |
|---|---|---|---|
| Balanced | Qwen3.5-35B-A3B (4-bit) | 40-60 t/s | MoE efficiency shines |
| Coding | GLM-4.7-Flash (4-bit) | 45-60 t/s | Still dominates SWE-Bench |
| General | Gemma 3 27B QAT | 20-30 t/s | Runs in 14.1GB |
Mac Unified Memory (32GB+)
The M4 Max is competitive with discrete GPUs for quantized models. Qwen3.5-27B and GLM-4.7-Flash both hit 60+ t/s on Metal, making Apple Silicon a legitimate local AI platform.
What This Means
The gap between open-weight and closed models continues to narrow. Current leaderboards show S-tier open models like GLM-4.7, Kimi K2.5, and MiniMax M2.5 matching or exceeding proprietary performance on specific benchmarks.
For local users, the practical takeaways:
- Qwen 3.5’s MoE models are the new default for general-purpose local inference
- GLM-4.7-Flash owns the coding niche if you have 24GB VRAM
- Nemotron 3 Super’s architecture is the future—watch for distilled versions
- Gemma 3 QAT makes Google’s model viable on consumer hardware
The hardware requirement wall keeps dropping. A year ago, running anything competitive required H100s. Now an RTX 4090 matches enterprise GPUs on 70B inference. Next year, expect the same for 100B+ models.
What You Can Do
If you have 24GB VRAM: Run GLM-4.7-Flash for coding, Qwen3.5-35B-A3B for everything else. Both hit 60+ t/s and compete with closed models on benchmarks.
If you have 12-16GB VRAM: Qwen3.5-35B-A3B at 4-bit is your best option. The MoE architecture means minimal quality loss from quantization.
If you’re on Mac: M4 Max or better gives you competitive performance. MLX optimizations for Qwen 3.5 are mature.
The open-weight ecosystem just had its strongest month since Llama 2 dropped. Check the home GPU leaderboard for current rankings—they’re updating weekly as new benchmarks arrive.