Last week’s column tracked six labs shipping competitive open-weight models. This week, we gained one and lost one. NVIDIA launched the Nemotron 3 family with a fundamentally different architecture that trades raw benchmark scores for throughput that embarrasses everything else in its class. Meanwhile, Meta—the company that popularized open-weight AI with Llama—shipped Muse Spark as a fully proprietary model, leaving the future of Llama in limbo.
The competitive picture just changed more in one week than in the previous three combined.
NVIDIA Nemotron 3: The Architecture Gamble
Every other open-weight model on the market is a Transformer with Mixture-of-Experts bolted on. NVIDIA went a different direction. The Nemotron 3 family uses a hybrid Mamba-Transformer MoE architecture—interleaving state space model (Mamba-2) layers with traditional Transformer attention layers, then adding MoE routing on top.
Why does this matter for your GPU? Transformer self-attention has a problem: memory and computation grow linearly with each generated token. The longer the conversation, the slower and more memory-hungry it gets. Mamba layers operate with constant memory and computation during generation, regardless of how many tokens have already been produced. By replacing most attention layers with Mamba layers while keeping a few Transformer layers at key depths for precise recall tasks, NVIDIA claims 4x improved memory and compute efficiency versus pure Transformers.
The family ships in three sizes:
| Model | Total Params | Active Params | Context | License | Consumer GPU? |
|---|---|---|---|---|---|
| Nemotron 3 Nano | 31.6B | ~3.6B | 1M | Open | Yes (Q4 fits 24 GB) |
| Nemotron 3 Super | 120.6B | 12.7B | 1M | Open | Tight (FP8 fits 24 GB) |
| Nemotron 3 Ultra | ~500B | ~50B | 1M | Open | Datacenter only |
The Nano model is the consumer hardware story. At 3.6B active parameters from a 31.6B total pool, it’s a direct competitor to GPT-OSS 20B and Qwen 3.5’s smaller MoE variants. The Q4_K_M GGUF weighs about 24.5 GB and fits a single RTX 4090.
The Throughput Story
Here’s where it gets interesting. NVIDIA designed these models for speed, and the numbers reflect it.
Nemotron 3 Nano provides 3.3x higher throughput than Qwen3-30B-A3B and 2.2x higher than GPT-OSS-20B on an H200 with 8K input and 16K output. The Super model pushes even harder—2.2x faster than GPT-OSS-120B and 7.5x faster than Qwen3.5-122B on the same benchmark.
Those numbers are on datacenter hardware. For consumer GPU users running Nemotron 3 Nano locally, the practical speed advantage over GPT-OSS 20B and similar models is real but smaller—the Mamba architecture’s constant-memory generation matters most during long conversations and large context windows, exactly the scenarios where other models slow down.
On quality, it’s a mixed bag. Nemotron 3 Nano scores 78.3% on MMLU-Pro versus Qwen3-30B’s 80.9%, so it slightly loses on general knowledge. But it takes the lead on coding with 68.3% on LiveCodeBench v6 versus Qwen3’s 66.0% and GPT-OSS’s 61.0%, and dominates Arena-Hard-v2 at 67.7% versus 57.8% for Qwen3-30B.
The tradeoff is clear: slightly less general knowledge, notably better coding and conversational quality, and dramatically better throughput. For agentic workflows and coding tasks—which is how most local AI enthusiasts actually use these models—Nemotron 3 Nano is a strong new contender.
Meta Walks Away From Open Source
On April 8, Meta debuted Muse Spark, the first model from Meta Superintelligence Labs. It’s natively multimodal, supports tool use, and offers three modes: Instant, Thinking, and Contemplating (which orchestrates multiple reasoning agents in parallel).
What it isn’t: open.
Unlike Llama, which shipped with weights available for download, Muse Spark is entirely proprietary. Meta is offering it through a private preview API to select partners. The company says it has “hope to open-source future versions”—language so noncommittal it could have been generated by a model set to temperature 2.0.
When VentureBeat asked about Llama’s future, Meta only confirmed that “existing Llama models would remain available.” Whether the Llama family will continue to be developed was left unanswered.
For this column, the practical impact is limited. Llama 4 Scout and Maverick still exist and still work. But Meta was the company that forced the entire industry toward open weights by proving you could build a competitive model and give it away. If they’re now reversing course, the open-weight movement loses its most visible advocate—even as it gains new participants like NVIDIA and OpenAI.
Updated Rankings: What Runs on Your Hardware
The table from last week, updated with Nemotron 3:
| Model | Total Params | Active Params | Context | License | Speed (RTX 4090, Q4) |
|---|---|---|---|---|---|
| Nemotron 3 Nano | 31.6B | ~3.6B | 1M | Open | ~50+ tok/s* |
| Gemma 4 31B | 31B | 31B | 256K | Apache 2.0 | ~30 tok/s |
| Gemma 4 26B-A4B | 26B | 3.8B | 256K | Apache 2.0 | ~64 tok/s |
| Qwen 3.5 27B | 27B | ~27B | 128K | Apache 2.0 | ~35 tok/s |
| Llama 4 Scout | 109B | 17B | 10M | Community | ~20 tok/s |
| Mistral Small 4 | 119B | 8B | 256K | Apache 2.0 | ~28 tok/s |
| GPT-OSS 120B | 117B | 5.1B | 131K | Apache 2.0 | ~30 tok/s |
| GPT-OSS 20B | 21B | 3.6B | 131K | Apache 2.0 | ~45 tok/s |
*Estimated from throughput multiplier claims; independent consumer benchmarks pending.
The Mamba-Transformer hybrid architecture gives Nemotron 3 Nano a structural advantage in long-context scenarios. Where a pure Transformer model slows down as context grows, the Mamba layers maintain constant memory and computation. If you regularly work with large codebases or extended conversations, this matters more than a 2-point MMLU-Pro gap.
Follow-Up: What We Were Watching
Last week’s column flagged three things to watch. Here’s where they stand:
Qwen 3.6 Plus open weights: Still not released. The model remains API-only through OpenRouter. Alibaba has said it will release “selected Qwen 3.6 models in developer-friendly sizes,” but no weights have appeared on Hugging Face. For now, the 1-million-token context window and 3x speed claims remain inaccessible to local users.
Independent GLM-5.1 benchmarks: Mostly still unverified. As of April 11, the SWE-Bench Pro and long-horizon task results come exclusively from Z.ai’s internal testing. Community reproductions on Reddit have partially corroborated some numbers, and Z.ai’s prior GLM-5 scores held up under third-party testing, which is encouraging. But no independent lab has published a full evaluation yet.
The vLLM FlashAttention fix for Gemma 4: Still unfixed. Gemma 4’s heterogeneous attention head dimensions continue to force a Triton fallback in vLLM, keeping enterprise serving speeds well below what the model is capable of. The llama.cpp KV cache fix remains the best option for local users. vLLM has added full Gemma 4 architecture support and recommends using their vllm-openai:gemma4 Docker image as a workaround, but the underlying performance gap persists.
What This Means
Three shifts define this week:
1. Architecture diversity arrived. For two years, the open-weight field was homogeneous—Transformers with varying numbers of experts. Nemotron 3’s Mamba-Transformer hybrid is the first competitive alternative architecture to reach consumer hardware. If NVIDIA’s throughput claims hold up under independent testing, it could push other labs to experiment beyond pure Transformers.
2. Open source lost its biggest champion. Meta didn’t just go proprietary with Muse Spark—they left Llama’s future deliberately ambiguous. The silver lining: NVIDIA, OpenAI, Google, Alibaba, Mistral, and Zhipu AI all still ship open weights. The movement is bigger than any single company. But Meta’s shift signals that the “open by default” era of frontier AI may have peaked.
3. The 1M context window is becoming standard. Nemotron 3’s entire family supports 1M tokens. GPT-5.4 has 1M tokens. Qwen 3.6 Plus has 1M tokens. Gemma 4’s 256K now looks conservative. For local users, this means the Mamba architecture’s constant-memory generation advantage will become increasingly relevant as models are expected to handle longer inputs.
My Picks This Week
Best all-around for 24 GB GPU: Still Gemma 4 31B. The KV cache fix makes it genuinely usable at 30+ tok/s, and no model at this quality tier has a more permissive license. If coding is your primary use case, Nemotron 3 Nano deserves a serious look.
Best lightweight option (16 GB GPU): Nemotron 3 Nano takes this from GPT-OSS 20B. Similar active parameter counts (3.6B vs 3.6B), but the Mamba architecture should deliver meaningfully better sustained performance during long sessions. Available via Ollama with ollama run nemotron-3-nano.
Best for agentic/coding workflows: Nemotron 3 Nano. The combination of 68.3% on LiveCodeBench, 1M token context, and the throughput advantage makes it the first model that’s genuinely optimized for the way coding agents work—lots of context, lots of tool calls, sustained generation.
Best daily driver: Qwen 3.5 27B. Still the fastest dense model at its quality level. If you don’t need the 1M context window or the coding edge, the 35 tok/s on an RTX 4090 remains hard to beat for general conversation.
What to Watch Next Week
Nemotron 3 Super independent benchmarks. NVIDIA’s throughput claims for Super are aggressive—2.2x faster than GPT-OSS-120B at comparable quality. Community testing on consumer hardware will either confirm a new tier leader or expose gaps between marketing and reality.
Nemotron 3 Ultra release. The ~500B model with 50B active parameters hasn’t shipped yet. If it lands with weights available, it becomes the largest open-weight model you can actually download (GLM-5.1 is technically larger but MIT-licensed, while Ultra’s licensing terms haven’t been confirmed).
Meta’s next move. The Muse Spark announcement left Llama’s future intentionally vague. The developer community is watching for any signal—will Llama 4 get further updates, or has Meta’s open-weight investment peaked? The answer matters for every project built on Llama infrastructure.
The scorecard: seven labs now have skin in the open-weight game, one just went proprietary, and a new architecture arrived that could change how we think about model efficiency on consumer hardware. This column started tracking six contenders. We’re at seven now, and the competition is only accelerating.