Open Source AI Wins: DeepSeek V4 Goes MIT, NVIDIA Ships Hybrid Mamba Models, and Google Solves the Memory Problem

The week’s biggest open-source story dropped Thursday: DeepSeek released V4 under the MIT license, matching Claude Opus 4.6 on coding benchmarks at a fraction of the cost. But the release that might matter more long-term came from Google — a compression algorithm that cuts LLM memory usage by 6x with no retraining required. Meanwhile, NVIDIA shipped an entirely new model architecture to the open, and Mistral proved small models can still compete.

Here’s what happened.

DeepSeek V4: 1.6 Trillion Parameters, MIT License, $3.48 per Million Tokens

DeepSeek released preview builds of V4 on April 24 in two sizes: V4-Pro at 1.6 trillion parameters with a one-million-token context window, and V4-Flash at 284 billion parameters. Both ship under the MIT license — download the weights, fine-tune them, deploy them commercially, no restrictions.

The benchmark numbers are what make this interesting. V4-Pro scores 80.6% on SWE-bench Verified, within 0.2 points of Claude Opus 4.6. On coding specifically, it leads Claude on Terminal-Bench 2.0 (67.9% vs 65.4%), LiveCodeBench (93.5% vs 88.8%), and posts a 3206 Codeforces rating. On MMLU it hits 88.4%.

The architectural innovations matter for anyone thinking about running these models. V4 replaces standard attention with a hybrid of Compressed Sparse Attention and Heavily Compressed Attention. In the million-token context setting, V4-Pro uses only 27% of the inference FLOPs and 10% of the KV cache compared to DeepSeek V3.2. The training used Manifold-Constrained Hyper-Connections for better signal propagation and the Muon optimizer for faster convergence.

The cost story is stark: $3.48 per million output tokens versus Claude’s $25. That’s a 7x gap at near-identical coding performance. The entire model was trained on 16,000 Hopper GPUs for $5.6 million — doubling DeepSeek’s own training efficiency from V3.

There are caveats. Preview builds are preview builds. The benchmarks come from DeepSeek’s own technical report and need independent verification. And the question of whether a model trained in China with potential government data access requirements should run your production code is one each team needs to answer for itself. But the raw capability-per-dollar ratio is undeniable.

Google’s TurboQuant: The Memory Breakthrough Nobody’s Talking About

While model releases grab headlines, Google quietly published something that could change how every model runs. TurboQuant, presented at ICLR 2026, is a compression algorithm that shrinks the KV cache — the memory bottleneck that limits how many tokens a model can process — by 4-6x with virtually no quality loss.

The technical approach is elegant in its simplicity. TurboQuant applies a random orthogonal rotation to each key-value vector, which spreads the energy uniformly across all coordinates. After rotation, each coordinate follows a predictable statistical distribution, so you can compute mathematically optimal quantization buckets ahead of time using the Lloyd-Max algorithm.

The key property: it needs no calibration data, no fine-tuning, and works on any transformer. At 3.5 bits, TurboQuant matches full-precision performance exactly. At 4 bits, it delivers up to 8x speedup on H100 attention computation versus 32-bit keys.

Community implementations already exist. A from-scratch PyTorch version claims 5x compression at 3-bit with 99.5% attention fidelity, and there’s an active llama.cpp discussion about integration. Once this lands in the inference engines that local AI users actually run — llama.cpp, vLLM, Ollama — it could meaningfully expand what hardware can run which models.

For anyone running long-context models locally, this is the paper to watch. A 6x reduction in KV cache memory means a model that needed 48GB of VRAM for a 128K context window might fit in 8GB. That’s the difference between needing a workstation and running on a laptop.

NVIDIA Nemotron 3: Hybrid Mamba-Transformer Goes Open

NVIDIA released the Nemotron 3 family in three sizes — Nano (3.6B total, 3.2B active), Super (120B total, 12B active), and Ultra — all using a hybrid Mamba-Transformer MoE architecture with million-token context.

The architecture choice is the story here. Mamba is a state-space model that handles long sequences more efficiently than pure transformers, but historically has been weaker at tasks requiring precise recall. NVIDIA’s hybrid approach combines both: Mamba layers for efficient long-range processing, transformer layers for the tasks that need exact attention. The result, according to NVIDIA, is 5x higher inference throughput compared to Nemotron 2.

Nemotron 3 Nano is particularly interesting for local use. At 3.2 billion active parameters, it achieves better accuracy than Nemotron 2 Nano while activating less than half the parameters per forward pass. The models ship with NVFP4 format support on Blackwell hardware and are available on Hugging Face and through multiple inference providers.

NVIDIA also published 10 trillion tokens of training data alongside the models. Open data alongside open weights is still rare enough to be worth noting — it means researchers can study what went into the model, not just what comes out.

Mistral 3: Apache 2.0 From Edge to Enterprise

Mistral shipped its third-generation model family under Apache 2.0, spanning from 3B parameter edge models to a 675B parameter flagship.

The range is what matters. Ministral 3 comes in 3B, 8B, and 14B variants — each with base, instruct, and reasoning versions — designed for laptops, phones, and edge devices. Mistral Large 3 is a 675B-parameter MoE with 41B active parameters and a 256K context window, trained on 3,000 H200 GPUs.

Mistral Large 3 debuted at #2 on the LMArena leaderboard for open non-reasoning models. More practically, the Ministral 3 8B instruct variant is small enough to run on a phone and capable enough for most assistant tasks.

Mistral also released Small 4, which unifies reasoning, multimodal, and agentic coding into a single model with configurable reasoning effort. The trend across all vendors this month is clear: one model, multiple capability modes, rather than a different model for each task.

The Pattern

April 2026 continues to be the densest month for open-source AI releases in the field’s history. But the pattern is shifting from “look at our benchmark numbers” to “look at our efficiency numbers.”

DeepSeek V4 doesn’t just match Claude on coding — it does it at 10% of the KV cache. TurboQuant doesn’t make models smarter — it makes them runnable on hardware you already own. Nemotron 3’s hybrid architecture isn’t about peak performance — it’s about throughput per dollar. Mistral 3’s edge models aren’t frontier-class — they’re phone-class.

The practical takeaway: if you’re running local AI, the hardware requirements just dropped significantly. DeepSeek V4’s sparse attention, TurboQuant’s cache compression, and the general MoE trend mean that the 2024 assumption of “you need a $10,000 GPU for a good model” is increasingly outdated. The models are getting smarter about using less, which matters more for adoption than making them marginally smarter on benchmarks nobody checks in production.

The cost gap between open and closed keeps widening too. When an MIT-licensed model matches a $25/million-token API at $3.48, the value proposition for self-hosting shifts from “ideological preference” to “basic economics.” Not everyone will run their own models — but the ones who do have never had better options.