Open-Weight LLM Showdown: Mistral Small 4 Arrives, DeepSeek V4 Finally Lands

Mistral drops a 119B MoE model under Apache 2.0, DeepSeek V4 emerges from stealth, and dual RTX 5090 setups are matching H100 on 70B inference. This week changed the game.

Server racks with blue LED lights in a modern data center

Two major releases landed this week: Mistral Small 4 dropped at GTC with 128 experts under Apache 2.0, and DeepSeek V4 finally emerged from its months-long stealth mode. Meanwhile, dual RTX 5090 benchmarks confirm what enthusiasts suspected—consumer hardware can now match enterprise GPUs on 70B inference.

Here’s what matters.

Mistral Small 4: The New Efficiency King

Announced at GTC on March 16, Mistral Small 4 is a 119B parameter MoE model that activates only 6B parameters per forward pass. That’s 128 experts with 4 active per token—a design choice that makes it remarkably efficient.

The headline numbers:

  • 119B total parameters, 6-8B active (8B including embeddings)
  • 256K context window—up from Small 3’s 128K
  • Apache 2.0 license—the most permissive option
  • Multimodal input—text and images

What sets Small 4 apart is configurable reasoning. You can toggle between fast, low-latency responses for simple tasks and deep, reasoning-intensive outputs for complex problems. According to benchmarks, this delivers 40% lower latency and triple the throughput compared to Small 3.

Performance Comparison

On standard benchmarks, Small 4 positions itself between Small 3.2 and Large 3:

BenchmarkSmall 3.2Small 4Large 3
MMLU80.5%83.2%85.5%
HumanEval92.9%94.1%95.8%
Arena Hard43.1%67.4%78.2%
IFEval82.3%88.7%92.1%

The Arena Hard jump—from 43% to 67%—represents a significant improvement in real-world conversational ability.

Local Deployment

For local inference, Small 4’s MoE architecture means the model is surprisingly runnable on consumer hardware. At INT4 quantization, it fits in approximately 40GB VRAM—achievable on dual RTX 4090s or a single 5090 with room to spare.

Ollama support is already available: ollama run mistral-small-4

DeepSeek V4: The Wait Is Over

After missed release windows in mid-February, late February, and early March, DeepSeek V4 finally launched around March 3. The developer community’s reaction has been mixed—enthusiasm about capabilities, skepticism about self-reported benchmarks.

The specs:

  • ~1 trillion total parameters, ~32B active
  • 1 million token context window
  • Native multimodal (text, image, video input; image generation)
  • MIT license

V4 introduces what DeepSeek calls “Manifold-Constrained Hyper-Connections” for training stability at trillion-parameter scale, plus “Engram Conditional Memory” for efficient retrieval over million-token contexts.

Benchmark Claims (Unverified)

Leaked benchmarks suggest V4 is competitive with current frontier models:

  • HumanEval: ~90% (would match Claude Opus 4.6)
  • SWE-bench Verified: 80%+ (top tier for code)
  • MATH: 92.4% (if accurate, best-in-class)

All V4 benchmark claims remain unverified until DeepSeek publishes official reports. The community has been burned before by inflated numbers.

The Practical Reality

V4’s 32B active parameters make it more demanding than V3’s 21B. Even with aggressive quantization, you’re looking at:

  • RTX 5090 (32GB): Tight fit at INT4, limited context
  • Dual 5090 (64GB): Comfortable at INT4, reasonable context
  • Mac Studio M4 Ultra 512GB: Full precision possible

For most local users, V3.2 remains the practical choice. V4 is more relevant for API access or enterprise deployments.

Dual RTX 5090: Consumer Hardware Hits Enterprise Territory

The most surprising development this week came from dual GPU benchmarks. Two RTX 5090s running Ollama now match H100 performance on 70B models—at a fraction of the cost.

The numbers:

  • DeepSeek-R1 70B: 33 tokens/second at 30K context
  • Llama 3.3 70B: 27 tokens/second (matching H100)
  • Cost comparison: 2× 5090 ($4K MSRP, $10K+ scalped) vs H100 ($30K+)

Important caveat: Ollama doesn’t parallelize inference across GPUs—it just pools VRAM. You won’t see 2× speedup from 2 cards. What you get is the ability to run larger models without spilling to CPU RAM.

For 110B+ models like Qwen 3.5 full, dual 5090s still struggle. GPU utilization caps at 20%, and inference drops to 7 tokens/second. Enterprise hardware retains its edge at the largest scales.

The Sweet Spots

Based on current benchmarks:

SetupBest Model ClassTokens/secNotes
Single RTX 509032B dense61-65Qwen 3.5 32B optimal
Single RTX 509030B MoE234Qwen 3 MoE screams
Dual RTX 509070B quantized27-33H100 territory
Single RTX 409027B dense35-45Still the value king

Updated Rankings

Combining leaderboard data with this week’s releases:

For Coding

  1. Qwen 3.5 - GPQA Diamond 88.4%, LiveCodeBench leader
  2. Mistral Small 4 - HumanEval 94.1%, configurable depth
  3. DeepSeek V4 - SWE-bench 80%+ (if benchmarks hold)

For Reasoning

  1. Kimi K2.5 - IFEval 94.0, AIME 96.1%
  2. Qwen 3.5 - Best GPQA Diamond (88.4%)
  3. Llama 4 Scout - 10M context for document reasoning

For Speed/Efficiency

  1. Mistral Small 4 - 40% lower latency than Small 3
  2. Gemma 3 27B - Dense architecture, no MoE overhead
  3. Qwen 3.5 Small - 9B runs everywhere

For Local Deployment

  1. Qwen 3.5-9B - Best quality under 10B
  2. Mistral Small 4 Q4 - Fits single 5090
  3. Gemma 3 27B Q4 - 14GB with QAT

Hardware Recommendations (March 21, 2026)

RTX 5090 Owners

You have options now:

  • Mistral Small 4 Q4 - The new efficiency standard
  • Qwen 3.5-32B full - Best dense model at this scale
  • Llama 4 Scout INT8 - When you need 10M context

RTX 4090 Owners

Still the practical sweet spot:

  • Mistral Small 4 Q4 - Tight but works
  • Gemma 3 27B Q4 - 14GB leaves room for context
  • Qwen 3.5-9B - Quality that rivals 70B from 2024

Dual GPU Enthusiasts

If you can acquire two 5090s:

  • DeepSeek-R1 70B - Full reasoning model at 33 tok/s
  • Llama 3.3 70B - H100-matching inference
  • Qwen 3.5-70B - The frontier, locally

Mac Users

Unified memory continues to differentiate:

  • M4 Max 128GB: Llama 4 Scout usable, V4 at reduced context
  • M4 Ultra 512GB: Everything fits, eventually

The Bottom Line

This week marked a shift. Mistral Small 4 proves that Apache-licensed, MoE-based models can compete with proprietary options while running on consumer hardware. DeepSeek V4’s arrival—despite benchmark skepticism—adds another trillion-parameter option to the open-weight ecosystem.

The dual 5090 benchmarks are perhaps most significant. Consumer hardware matching H100 performance on 70B models wasn’t expected this soon. Yes, you still can’t buy a 5090 at MSRP. But the performance ceiling for home labs keeps rising.

For most users, the practical action remains unchanged: Qwen 3.5-9B via Ollama handles 90% of tasks. When you need more, Mistral Small 4 and Gemma 3 27B offer excellent quality-to-resource ratios.

What to Try This Week

  1. Mistral Small 4 - ollama run mistral-small-4 and test configurable reasoning

  2. DeepSeek V4 via API - If self-hosting is impractical, try the hosted version first

  3. Dual GPU owners - Benchmark DeepSeek-R1 70B at extended context

  4. Everyone else - Qwen 3.5-9B remains the default recommendation

Next week: We’ll see if V4’s benchmarks hold under independent testing, and whether the 5090 supply situation improves.