Open-Weight LLM Showdown Week 5: Qwen 3.5 Dominates, Nemotron 3 Super Redefines Efficiency

Qwen 3.5's MoE models hit S-tier benchmarks, NVIDIA's Nemotron 3 Super delivers 5x throughput gains, and GLM-4.7-Flash brings frontier coding to consumer GPUs. The open-weight race just accelerated.

Close-up of a graphics card with RGB lighting and cooling fans

The open-weight leaderboard reshuffled this week. Qwen 3.5’s MoE variants are pushing frontier-class performance onto consumer hardware, NVIDIA’s Nemotron 3 Super is delivering 5x throughput improvements over its predecessor, and GLM-4.7-Flash is proving that 30B models can smoke larger competitors on coding benchmarks.

Here’s what actually matters for running models locally.

Qwen 3.5: MoE Done Right

Alibaba’s Qwen team released the full Qwen 3.5 family over the past month, and the benchmarks are hard to ignore. The flagship 397B-A17B model uses sparse Mixture-of-Experts—397 billion total parameters, but only 17 billion active per forward pass.

The numbers that matter:

  • 91.3% on AIME 2026 (mathematical reasoning)
  • 88.4% on GPQA Diamond (the highest among open models)
  • 76.5% on IFBench (beating GPT-5.2’s 75.4%)
  • Apache 2.0 license across the entire family

But the flagship isn’t what’s exciting for local users. The 35B-A3B variant activates only 3 billion parameters per token while maintaining competitive benchmarks. That’s frontier-adjacent performance on an RTX 4060.

Consumer Hardware Performance

ModelRTX 4090 (Q4)Mac M4 MaxVRAM Required
Qwen3.5-27B~21 t/s~21 t/s16GB (8-bit)
Qwen3.5-35B-A3B60-100 t/s65+ t/s12GB (4-bit)
Qwen3.5-122B-A10B25-35 t/sN/A48GB+

The 35B-A3B is the sweet spot for most users—it fits in 12GB VRAM at 4-bit quantization and runs fast enough for interactive use. The dense 27B is slower but scores higher on nuanced reasoning tasks.

Nemotron 3 Super: Agentic Throughput Monster

NVIDIA announced Nemotron 3 Super at GTC, and it’s designed specifically for agentic workloads. The architecture is a hybrid Mamba-Transformer MoE—120B total parameters with 12B active.

What makes it stand out:

  • 60.47% on SWE-Bench Verified (real-world coding)
  • 1M token context window
  • 2.2x higher throughput than GPT-OSS-120B
  • Multi-token prediction for 3x faster inference

The throughput numbers are particularly impressive. Median output across providers hits 414.6 tokens per second—well above the 81.8 t/s average for similar-sized models. That’s 5x faster than the previous Nemotron Super.

For local deployment, Nemotron 3 Super needs 64GB of unified memory (RAM or VRAM). That’s Mac Studio territory or dual high-end GPUs. Not consumer-friendly yet, but the architecture innovations will filter down.

GLM-4.7-Flash: Coding Benchmark Crusher

Zhipu AI’s GLM-4.7-Flash continues to impress a month after release. The 30B MoE model is optimized specifically for coding tasks, and the benchmarks reflect that focus.

The standout result: 59.2% on SWE-Bench Verified. For comparison, Qwen3-30B scores 22% and GPT-OSS-20B scores 34%. That’s a massive gap for real-world software engineering tasks.

Local Performance

Community testing shows GLM-4.7-Flash hitting:

  • 82 t/s on M4 Max MacBook Pro
  • 60-80 t/s on RTX 3090/4090
  • ~18GB VRAM required (4-bit quantization)

The speed advantage comes from aggressive MoE sparsity. If you’re primarily running coding tasks, GLM-4.7-Flash is currently the best performer in the 24GB VRAM tier.

Gemma 3 QAT: Google’s Quantization Play

Google’s Quantization-Aware Training approach for Gemma 3 deserves attention. The 27B model drops from 54GB (BF16) to just 14.1GB with INT4 quantization while preserving near-original quality.

A single RTX 3090 can now run Gemma 3 27B with room to spare for KV cache. That’s a significant accessibility improvement.

The catch: Gemma 3 trails the Chinese models on most benchmarks. It’s a solid general-purpose option, but if you care about leaderboard position, Qwen 3.5 and GLM-4.7 are ahead.

The New Local AI Tier List

Based on this week’s benchmarks and community testing:

24GB VRAM (RTX 3090/4090)

Best ForModelSpeedWhy
CodingGLM-4.7-Flash60-80 t/s59.2% SWE-Bench Verified
GeneralQwen3.5-35B-A3B60-100 t/sBest reasoning per VRAM
QualityQwen3.5-27B (8-bit)20-25 t/sDense model, no MoE shortcuts
SpeedMistral Small 4 (Q4)40-60 t/s256K context, multimodal

12-16GB VRAM (RTX 4070/4080)

Best ForModelSpeedWhy
BalancedQwen3.5-35B-A3B (4-bit)40-60 t/sMoE efficiency shines
CodingGLM-4.7-Flash (4-bit)45-60 t/sStill dominates SWE-Bench
GeneralGemma 3 27B QAT20-30 t/sRuns in 14.1GB

Mac Unified Memory (32GB+)

The M4 Max is competitive with discrete GPUs for quantized models. Qwen3.5-27B and GLM-4.7-Flash both hit 60+ t/s on Metal, making Apple Silicon a legitimate local AI platform.

What This Means

The gap between open-weight and closed models continues to narrow. Current leaderboards show S-tier open models like GLM-4.7, Kimi K2.5, and MiniMax M2.5 matching or exceeding proprietary performance on specific benchmarks.

For local users, the practical takeaways:

  1. Qwen 3.5’s MoE models are the new default for general-purpose local inference
  2. GLM-4.7-Flash owns the coding niche if you have 24GB VRAM
  3. Nemotron 3 Super’s architecture is the future—watch for distilled versions
  4. Gemma 3 QAT makes Google’s model viable on consumer hardware

The hardware requirement wall keeps dropping. A year ago, running anything competitive required H100s. Now an RTX 4090 matches enterprise GPUs on 70B inference. Next year, expect the same for 100B+ models.

What You Can Do

If you have 24GB VRAM: Run GLM-4.7-Flash for coding, Qwen3.5-35B-A3B for everything else. Both hit 60+ t/s and compete with closed models on benchmarks.

If you have 12-16GB VRAM: Qwen3.5-35B-A3B at 4-bit is your best option. The MoE architecture means minimal quality loss from quantization.

If you’re on Mac: M4 Max or better gives you competitive performance. MLX optimizations for Qwen 3.5 are mature.

The open-weight ecosystem just had its strongest month since Llama 2 dropped. Check the home GPU leaderboard for current rankings—they’re updating weekly as new benchmarks arrive.