Open-Weight LLM Showdown: RTX 5090 Finally Delivers, But You Can't Buy One

Two weeks ago, we covered the frontier open-weight models you can’t actually run at home. Now the RTX 5090 benchmarks are in, Llama 4 Scout is getting optimized for local inference, and Qwen 3.5 Small has had time to prove itself on real hardware.

Here’s what’s changed.

The RTX 5090 Picture

The card everyone’s been waiting for delivers exactly what the specs promised - when you can find one.

According to RunPod’s comprehensive testing, the RTX 5090 achieves 5,841 tokens/second at batch size 8 with 1024 context, outperforming the A100 by 2.6x. It’s the only consumer GPU to hit sub-100ms time-to-first-token, with the 4090 lagging about 33% behind.

The generational improvement breaks down like this:

Overall AI workloads: 60-80% faster than RTX 4090
NLP tasks: 72% improvement
Computer vision: 44% improvement
Average inference: 213 tokens/second vs 127 tokens/second

The 32GB GDDR7 opens up models that choke on 24GB. You can now run quantized 70B models with reasonable context lengths. Llama 3.3 405B at 15-20 tokens/second is possible with quantization, up from the 4090’s 8-12 tokens/second.

The Catch

Good luck buying one. The January 30 launch sold out globally in five minutes. As of mid-March, flagship models hover around $5,000+ against the original $1,999 MSRP. Supply forecasts suggest shortages persist through mid-2026.

The practical reality: if you have a 4090, you’re fine. It handles 95% of use cases. The 5090 is aspirational hardware for most local AI users right now.

Qwen 3.5 Small: Two Weeks Later

Alibaba’s Qwen 3.5 Small series has been available for two weeks, and real-world testing confirms the hype.

The 9B model runs comfortably on:

RTX 4060 with room for context
M4 MacBook Air
iPhones in airplane mode (2B variant)

The scores that matter:

MMMU-Pro visual reasoning: 70.1 (beats Gemini 2.5 Flash-Lite at 59.7)
HumanEval coding: 76.0 (highest for any sub-8B model)
Multilingual: 201 languages via 250K vocabulary

For the 2B model on a phone: it processes text and images locally without network connectivity. No cloud, no API calls, no data leaving your device.

Ollama downloads confirm adoption. Qwen 3.5 sits at 1.9 million pulls, making it one of the most downloaded models on the platform.

The Llama 4 Local Inference Story

Llama 4 Scout (17B active parameters from 109B MoE) is getting serious optimization attention. NVIDIA’s TensorRT-LLM pushes it to 40K+ tokens/second on Blackwell B200.

For consumer hardware, the picture is more modest:

RTX 5090: ~150 tokens/second with INT4 quantization
RTX 4090: ~90-100 tokens/second
M4 Max 128GB: ~40 tokens/second

Scout’s 10 million token context window is its killer feature - no other open model comes close. But context at that scale needs memory. Budget at least 64GB for reasonable performance.

Maverick (400B MoE) remains impractical locally. Even aggressive quantization demands enterprise hardware.

The Head-to-Head: What Actually Wins

According to the self-hosted LLM leaderboard updated March 12, here’s how the major contenders stack up for practical deployment:

Task	Winner	Notes
Reasoning	Llama 4 Scout	109B knowledge capacity pays off
Math	Qwen 3.5	AIME: 48.7 vs Scout’s 42.1
Coding	Qwen 3.5	Clear margin on LiveCodeBench, SWE-bench
Multilingual	Qwen 3.5	201 languages, CJK dominance
Speed (dense)	Gemma 3 27B	35-40% faster than Scout
Context length	Llama 4 Scout	10M tokens, nothing else close

For most deployments, Qwen 3.5 wins or ties on 5 of 8 categories, has the most permissive license (Apache 2.0), and offers model sizes from 0.8B to 397B.

Gemma 3: The Speed Pick

Gemma 3 27B generates 33 tokens/second in LM Studio on Mac Studio M3 Ultra, 24 tokens/second in Ollama. Dense architecture means no MoE overhead - what you load is what runs.

Google’s QAT (Quantization-Aware Training) variants drop the model from 54GB to 14GB at INT4 with minimal quality loss. That’s a 4090-friendly size with near-original performance.

For interactive chat where response latency matters more than benchmark scores, Gemma 3 remains hard to beat.

DeepSeek V4: Still Waiting

We covered DeepSeek V3.2’s competition wins last time. The V4 announcement has been imminent since late February.

Current status: still not launched. The mid-February window, late-February window, and early-March windows all passed. Some community members are calling a recent website update “V4 Lite,” but DeepSeek hasn’t confirmed anything.

When it drops - trillion parameters, native multimodal, MIT license - it will reshape these rankings. Until then, V3.2 remains the best DeepSeek option.

Hardware Recommendations (Updated)

RTX 5090 (32GB)

If you somehow acquired one:

Llama 4 Scout at full INT8 precision
Qwen 3.5 full 397B with aggressive quantization
MiniMax M2.5 with breathing room
Multiple 70B models for A/B testing

RTX 4090 (24GB)

Still the practical sweet spot:

Qwen 3.5-9B (new standard for quality-to-size)
Gemma 3 27B Q4 (speed king)
Llama 4 Scout Q4 (long context needs)
Mistral 3 14B (solid all-rounder)

8-16GB Cards (4060/4070)

The efficiency tier has never been better:

Qwen 3.5-4B (runs anywhere)
Qwen 3.5-9B Q4 (tight fit but works)
Gemma 3 9B (fast inference)

Apple Silicon M4

Unified memory advantage plays differently now:

M4 Max 48GB: Run 70B models slower than GPUs but with more headroom
M4 Max 128GB: Llama 4 Scout at usable speeds
M4 Ultra 512GB: Everything fits, eventually

The Bottom Line

The 5090 delivers generational improvements that matter - when you can buy one. Until supply normalizes in mid-2026, the 4090 remains the practical champion.

Meanwhile, the sub-10B tier has gotten good enough that many tasks don’t need more. Qwen 3.5-9B on a 4060 outperforms what frontier models delivered two years ago.

The gap between local and cloud AI isn’t closing - it’s fragmenting. For simple tasks, local has won. For frontier capabilities, cloud still leads. The middle ground keeps expanding in both directions.

What to Try This Week

Qwen 3.5-9B via Ollama. ollama run qwen3.5:9b and test it against whatever you were using before.
Gemma 3 27B Q4 if you have 24GB VRAM and want interactive speeds.
Llama 4 Scout if you have context-heavy workflows - documents, codebases, long conversations.
The 2B phone models if you care about privacy. Real AI inference on airplane mode isn’t a gimmick anymore.

Next week: the DeepSeek V4 drop (maybe) and whatever Jensen announces at GTC.