Open-Weight LLM Showdown Week 6: MiMo-V2-Flash Brings 309B Parameters to Consumer GPUs

MiMo-V2-Flash runs 309B parameters on RTX 4090s. GLM-5 sets benchmarks but needs datacenters. Llama 4 Scout stays out of reach.

Close-up of computer hardware with illuminated circuitry and cooling components

The open-weight landscape split further this week. Xiaomi’s MiMo-V2-Flash proved you can squeeze 309 billion parameters onto a single RTX 4090. Meanwhile, Zhipu AI’s GLM-5 set new benchmarks that require hardware most of us will never own. And Llama 4 Scout remains the model everyone wants to run locally but can’t.

Here’s what actually changed for consumer hardware users.

MiMo-V2-Flash: The Consumer GPU Breakthrough

Xiaomi released MiMo-V2-Flash with a MIT license and a promise: frontier-class reasoning on consumer hardware. The model packs 309 billion total parameters but activates only 15 billion per token through its MoE architecture.

The benchmark numbers that matter:

  • 94.1% on AIME 2025 (mathematical reasoning—within 0.5% of GPT-5 High)
  • 89.2% on GPQA Diamond (scientific reasoning)
  • 150 tokens/second on optimized infrastructure

But those server-side numbers don’t tell the consumer story. Community testing shows what actually happens on real hardware:

RTX 4090 Performance (24GB VRAM)

QuantizationVRAM UsedSpeedQuality
Q3_K_S~22GB20-25 t/sGood for reasoning
IQ3_XS~20GB25-30 t/sSlight quality drop
INT8~24GB18-22 t/sBest quality fit

These speeds are interactive. The architecture innovations that make this possible:

  1. Hybrid attention with 128-token sliding windows reduces KV-cache storage by 6x
  2. Multi-Token Prediction generates multiple tokens per forward pass
  3. Aggressive MoE sparsity keeps only 15B parameters hot at any time

The catch: context length. MiMo-V2-Flash’s aggressive memory optimizations limit practical context to around 8K tokens on consumer GPUs. For longer conversations or document analysis, you’ll need more VRAM or accept significant slowdowns from CPU offloading.

GLM-5: Frontier Performance, Datacenter Requirements

Zhipu AI’s GLM-5 set the new high-water mark for open models in February. The 744B parameter MoE model (40B active) beats Claude Opus 4.5 on Humanity’s Last Exam and holds the #1 spot on SWE-bench Verified at 77.8%.

Notable: the entire model trained on Huawei Ascend chips using MindSpore. Zero NVIDIA dependency. That’s a technical achievement worth acknowledging regardless of where you stand on geopolitics.

But running GLM-5 locally is a different story:

FormatMemory RequiredHardware Needed
FP16/BF16~1.5 TB8x H200 minimum
2-bit GGUF~300 GBSingle 24GB GPU + 300GB RAM
1-bit~180 GBImpractical speed

The 2-bit quantization works but requires serious CPU offloading. You can technically run GLM-5 on a 24GB GPU with 300GB of system RAM using MoE offloading, but expect single-digit tokens per second. It’s a proof of concept, not a daily driver.

This is the growing divide in “open-weight” models: technically available weights that require datacenter hardware to actually run.

Llama 4 Scout: So Close, Yet So Far

Meta’s Llama 4 Scout remains the most discussed model that nobody can run locally. The 17B active parameter MoE (109B total) needs approximately 55GB VRAM at Q4 quantization—more than double what an RTX 4090 provides.

The frustration is real because the benchmarks are excellent:

  • 88.8% on ChartQA (image understanding)
  • 94.4% on DocVQA (document parsing)
  • 10M token context window

Some users report success with Unsloth’s 1.78-bit quantization fitting in 24GB VRAM at around 20 tokens/second. But 1.78-bit quantization involves significant quality tradeoffs—you’re not getting the full Scout experience.

The practical situation:

HardwareScout Status
RTX 4090 (24GB)Too large at useful quantizations
Mac Studio 128GBRuns at Q4, competitive speed
Dual GPU (48GB)Works but complex setup
1.78-bit quantFits but degraded quality

For most users, Llama 4 Scout is a “wait for distillation” model. Meta will likely release smaller variants trained on Scout outputs—that’s when consumer hardware users get access.

The Updated Local AI Tier List

Based on this week’s findings and community benchmarks:

24GB VRAM (RTX 3090/4090)

Best ForModelSpeedWhy
ReasoningMiMo-V2-Flash (Q3)20-30 t/s94.1% AIME, fits in VRAM
CodingGLM-4.7-Flash60-80 t/s59.2% SWE-Bench Verified
GeneralQwen3.5-35B-A3B60-100 t/sBest overall balance
QualityQwen3.5-27B (8-bit)20-25 t/sDense model, no MoE shortcuts

12-16GB VRAM (RTX 4070/4080)

Best ForModelSpeedWhy
BalancedQwen3.5-35B-A3B (4-bit)40-60 t/sMoE efficiency
CodingGLM-4.7-Flash (4-bit)45-60 t/sStill dominates SWE-Bench
GeneralGemma 3 27B QAT20-30 t/sRuns in 14.1GB

The “Wait List” (Too Large for Consumer GPUs)

ModelWhy Wait
Llama 4 ScoutNeeds 55GB+ at Q4
GLM-5Needs 300GB+ for usable speeds
Llama 4 MaverickNeeds 200GB+

What This Means

The open-weight ecosystem is bifurcating. On one side: models like MiMo-V2-Flash, Qwen 3.5, and GLM-4.7 that are genuinely usable on consumer hardware. On the other: “open-weight” frontier models that require enterprise infrastructure.

This isn’t necessarily bad. GLM-5’s existence proves open models can compete at the absolute frontier. The research and techniques will filter down to smaller, more accessible models over time.

But it does mean the phrase “open-weight” needs context. Open weights you can download are different from open weights you can run.

For this week’s practical takeaways:

  1. MiMo-V2-Flash is the new reasoning champion for 24GB GPUs—94.1% on AIME at 20+ t/s is remarkable
  2. GLM-4.7-Flash remains the coding king if you primarily work with code
  3. Qwen 3.5’s MoE models stay the general-purpose default for balance of speed and quality
  4. Llama 4 Scout isn’t a consumer model yet—wait for distillations or buy a Mac Studio

What You Can Do

If you have 24GB VRAM and want reasoning: Download MiMo-V2-Flash from Hugging Face, grab the Q3_K_S GGUF, and run through Ollama or llama.cpp. Expect 20-30 t/s and competitive performance with GPT-5 on math and science reasoning.

If you want the fastest coding assistant: GLM-4.7-Flash at 4-bit still beats everything else on SWE-Bench Verified while hitting 60+ t/s.

If you’re frustrated with Llama 4 Scout: Check Unsloth’s 1.78-bit quantizations if you’re willing to accept quality tradeoffs. Otherwise, wait for Meta’s inevitable smaller variants.

If you have Mac hardware: The M4 Max and Studio remain the best value for local inference. 128GB unified memory runs Scout at full Q4 quality with room for context. The RTX 4090 can’t match that capability ceiling despite faster raw throughput on smaller models.

The home GPU leaderboard updates frequently—check it before downloading anything to verify current rankings. The open-weight space moves fast, and what was optimal last month may have been eclipsed.