Open-Weight LLM Showdown: GTC Pivots to Inference, DeepSeek V4 Still MIA

Jensen Huang bets on inference chips, Ollama adds multimodal support, and DeepSeek V4 remains the most anticipated release that hasn't happened yet.

Close-up of server hardware with glowing blue lights

Last week we wondered what Jensen would announce at GTC. The answer: NVIDIA is done talking about training. The next AI boom belongs to inference, and NVIDIA just spent $20 billion to prove it.

Here’s what changed in the open-weight world this week.

GTC 2026: The Inference Pivot

Jensen Huang’s keynote framed GTC 2026 as “an inference keynote, an agent keynote, and an AI-factory keynote.” Training got maybe ten minutes. The rest focused on running models fast enough for agentic workflows.

The headline hardware: Groq 3 LPU, born from NVIDIA’s $20 billion acquisition of the inference startup Groq last December. It’s a fundamentally different architecture - SRAM-based instead of HBM, with 150 TB/s memory bandwidth versus the Rubin GPU’s 22 TB/s.

The numbers: Groq 3 LPX racks paired with Vera Rubin deliver 35x higher tokens per watt compared to Blackwell alone. Ships Q3 2026.

What This Means for Local AI

Nothing immediate. The Groq 3 LPU is enterprise hardware - 256 chips per rack, data center deployment. But the strategic signal matters: NVIDIA sees inference as the next trillion-dollar opportunity. That means:

  1. Consumer inference hardware will improve. When NVIDIA prioritizes something, the whole supply chain follows.
  2. Cloud inference costs will drop. Competition from Groq architecture pushes API prices down.
  3. Quantization and optimization matter more. The inference focus validates everything the local AI community has been working on.

For now, your 4090 is still the practical choice. But the runway for better consumer inference silicon just got longer.

DeepSeek V4: Three Weeks of “Any Day Now”

We predicted DeepSeek V4 might drop at GTC. It didn’t. What we know, according to emerging specs:

  • ~1 trillion total parameters with ~32B active per token
  • 1 million token context window (up from V3’s 128K)
  • Native multimodal - image, video, and text generation
  • MIT license expected

The February 27 Financial Times report said “first week of March.” That window closed. A March 9 website update some called “V4 Lite” wasn’t confirmed by DeepSeek.

Unverified internal benchmarks claim 90% HumanEval and 80%+ SWE-bench. Impressive if true. Unverifiable until release.

The best prediction market puts V4 launching in March 2026 at around 65% probability. Ten days left in the month.

When It Drops

If the specs hold, V4 reshapes the local AI conversation. A 32B active parameter model with trillion-parameter knowledge density and million-token context would outperform most current options while remaining runnable on high-end consumer hardware.

The MIT license means no restrictions. Fine-tune it, deploy it commercially, build products on it.

Until then, DeepSeek V3.2 remains the best DeepSeek option. Qwen 3.5 and Llama 4 Scout handle most use cases.

Ollama 0.7: Multimodal Goes Mainstream

While waiting for DeepSeek, Ollama shipped something real. Version 0.7 adds multimodal support via a new inference engine:

  • Llama 4 Scout & Maverick with vision
  • Gemma 3 multimodal variants
  • Qwen 2.5 VL for vision-language tasks
  • Mistral Small 3.1 with image understanding

The practical impact: you can now run ollama run llama4:scout and send it images alongside text. Local vision models just got dramatically easier to deploy.

The update also includes:

  • Web search integration
  • Optimized 4-bit quantization (Q4_K_M)
  • Fixes for Qwen 3.5 stability issues (the model was repeating itself due to missing presence penalty)

If you haven’t updated: ollama update then ollama run gemma3:27b-it to test vision capabilities.

Mistral Large 3: The 675B MoE Alternative

Mistral quietly pushed Mistral Large 3 to the front of open-weight options. The architecture: 675B total parameters, 41B active during inference.

Performance claims:

  • Comparable to DeepSeek 3.1 670B and Kimi K2 1.2T on standard benchmarks
  • #2 open model on LMArena, #6 overall
  • Leads on MMMLU for general knowledge and reasoning

The MoE architecture means throughput scales well under concurrent load. You’re activating 41B parameters per request, not 675B.

Hardware requirements: 8x A100 is the accessible path for most deployments. Consumer hardware can run smaller Mistral variants - the 3.1 Small at 24B fits comfortably on a 4090.

Updated Recommendations

What Changed This Week

ModelUpdateImpact
Llama 4 Scout/MaverickOllama 0.7 multimodalVision tasks now easy locally
Gemma 3Ollama 0.7 multimodalSame - vision works out of box
Qwen 3.5Stability fixesFewer repetition issues
Mistral Large 3Benchmark resultsStrong alternative for API access

The Current Tier List

Frontier Open-Weight (API or high-end server):

  • Mistral Large 3 (675B MoE) - best open-weight option if you have 8xA100
  • Waiting: DeepSeek V4 (whenever it ships)

High-End Consumer (RTX 5090 / 4090 / M4 Max):

  • Llama 4 Scout - 10M context, now with vision
  • Qwen 3.5-32B - best coding, best math
  • DeepSeek V3.2 - still excellent all-rounder

Mid-Range (16-24GB VRAM):

  • Gemma 3 27B Q4 - fastest inference, now multimodal
  • Qwen 3.5-9B - punches above weight class
  • Mistral Small 3.1 - solid with vision support

Entry (8-16GB VRAM):

  • Qwen 3.5-4B - runs anywhere, surprisingly capable
  • Gemma 3 9B - fast and multimodal
  • Llama 3.3 8B - reliable baseline

What to Try This Week

  1. Ollama 0.7 multimodal - Update and test vision: ollama run gemma3:27b-it then send an image path.

  2. Llama 4 Scout with images - ollama run llama4:scout for vision + long context combined.

  3. Qwen 3.5-9B if you haven’t yet - stability improvements make it more reliable now.

  4. Watch for DeepSeek V4 - check their GitHub/website. When it drops, it’ll be the story of the month.

Next week: DeepSeek V4 (still maybe) and the first real-world Groq 3 benchmarks from third parties.