Open-Source AI Wins: NVIDIA Goes All-In on Open Models, Lightricks Ships 4K Video, and Local Inference Matures

NVIDIA’s GTC 2026 wrapped yesterday, and the biggest announcements weren’t proprietary hardware — they were open models designed to run locally. Lightricks shipped 4K video generation under Apache 2.0. And the inference stack that makes everything work just got meaningfully faster.

Here’s what matters from the last ten days.

Nemotron 3 Super: NVIDIA’s Open Agent Model

NVIDIA announced Nemotron 3 Super at GTC on March 11 — a 120-billion-parameter model with only 12 billion active parameters per forward pass. The mixture-of-experts architecture means it runs on hardware that couldn’t touch a dense model this size.

The architecture uses specialist activation (four experts for the cost of one) and multi-token prediction (predicting multiple words simultaneously). NVIDIA claims 3x faster inference compared to equivalent dense models.

What sets it apart: a 1-million-token context window and specific optimization for agentic workflows. On PinchBench — a new benchmark measuring how well models perform with OpenClaw and similar agent frameworks — Nemotron 3 Super scored 85.6%, the highest among open models.

The model ships under a permissive license with weights available on Hugging Face, OpenRouter, and Perplexity. CodeRabbit, Factory, and Greptile are already integrating it for AI coding agents.

NVIDIA also announced NemoClaw, an open-source framework that runs autonomous AI agents locally on RTX PCs and DGX systems without per-token cloud fees. If you have an RTX GPU and want to experiment with agents, this is the official stack now.

LTX 2.3: Open-Source 4K Video With Audio

Lightricks released LTX 2.3 in early March — a 22-billion-parameter video model that generates native 4K at 50 FPS with synchronized audio in a single pass.

The improvements over 2.2:

Sharper visual detail: More consistent textures, less blur on motion
Cleaner audio: Fewer artifacts in generated soundtracks
Native portrait mode: 1080x1920 without cropping or stretching
Faster inference: 8 denoising steps on the distilled variant

LTX 2.3 runs under Apache 2.0, meaning commercial use and fine-tuning are permitted. Weights are on Hugging Face in three variants: base dev, FP8 quantized, and distilled. ComfyUI has day-0 support.

Lightricks is also shipping LTX Desktop Beta, a free desktop application that runs the entire pipeline locally on consumer hardware. No cloud, no subscription, no data leaving your machine.

For video generation, LTX 2.3 is now the best open option. Runway Gen-4 and Sora still exist, but they cost money per generation and require uploading your prompts to external servers.

Llama 4: Meta’s Multimodal MoE

Meta’s Llama 4 shipped with two models under permissive licenses: Scout and Maverick. Both use mixture-of-experts and process text, images, and short video natively.

Llama 4 Scout (17B active / 16 experts):

Fits on a single H100
10M token context window
Beats Gemma 3, Gemini 2.0 Flash-Lite across standard benchmarks

Llama 4 Maverick (17B active / 128 experts):

Matches GPT-4o and Gemini 2.0 Flash on multimodal tasks
Comparable to DeepSeek v3 on reasoning/coding at half the active parameters

These are Meta’s first MoE models. The architecture matters because it allows massive total parameter counts while keeping inference costs reasonable — only a subset of experts activates per token.

Both models are on Hugging Face now. If you’ve been waiting for a multimodal open model that doesn’t require renting a data center, Scout is the answer.

Qwen 3.5 9B: Small Model, Big Results

Alibaba’s Qwen 3.5 9B continues to impress two weeks after release.

The model uses a Gated DeltaNet hybrid architecture — linear attention combined with sparse mixture-of-experts. Linear attention maintains constant memory complexity instead of quadratic scaling, letting the model allocate more parameters toward actual reasoning.

The benchmark results that matter:

GPQA Diamond: 81.7% (OpenAI’s GPT-OSS-120B scored 80.1%)
Graduate-level reasoning: Beats models 13x its size

This is a 9B model outperforming 120B models on PhD-level science questions. The architecture efficiency gains are real.

The practical version: Qwen 3.5 9B runs on an RTX 4060 or M4 MacBook Air with reasonable context. Frontier reasoning performance on consumer hardware.

vLLM v0.17.1: Production Inference

vLLM hit v0.17.1 on March 11 with the features production deployments need:

FP8 inference: Significant throughput improvement on H100 and Blackwell GPUs
Continuous batching by default: No more configuration for basic efficiency
Streaming support: Real-time token streaming out of the box
Multi-modal support: LLaVA, Qwen2.5-VL built in
Structured outputs: Guided decoding for agent/tool-calling workloads
Speculative decoding: Faster generation via draft models

The CUDA 13 nightly wheel stack now works out of the box. If you tried vLLM before and bounced off dependency hell, it’s worth another shot.

llama.cpp and Ollama: 35% Faster

NVIDIA’s CES 2026 optimizations are now live in both llama.cpp and Ollama:

35% faster on MoE models in llama.cpp on NVIDIA GPUs
30% faster in Ollama on RTX PCs
GPU token sampling: TopK, TopP, Temperature, minK, minP offloaded to GPU
Concurrent CUDA streams: Parallel model loading on multi-GPU setups
MMVQ kernel optimizations: Pre-loading data into registers

These are free speed improvements. Update your local stack.

What This Means

Three shifts from the past two weeks:

NVIDIA is betting on open. Nemotron 3 Super under a permissive license, NemoClaw as an open agent framework, optimizations contributed upstream to llama.cpp and Ollama. The company that could lock everyone into proprietary CUDA is choosing to build open infrastructure instead.

Video generation is now local. LTX 2.3 running 4K at 50 FPS on consumer hardware changes who can make AI video. You don’t need a Runway subscription or to upload your content to Sora. The model runs on your GPU.

Small models keep punching up. Qwen 3.5 9B matching or beating 100B+ models on graduate reasoning. Llama 4 Scout fitting on a single GPU while competing with GPT-4o. The “you need a data center for frontier AI” narrative is collapsing.

What You Can Do

Try NemoClaw if you have an RTX GPU and want to experiment with AI agents. NVIDIA is positioning this as the official local agent stack.

Run LTX 2.3 through ComfyUI or the desktop beta. 4K video generation with audio, no cloud required.

Update llama.cpp and Ollama. The 30-35% speed improvements are real and require no configuration changes.

Deploy Qwen 3.5 9B or Llama 4 Scout for local inference that competes with cloud APIs. Both run on hardware you can actually buy.

The frontier isn’t behind paywalls anymore. This week, it shipped open-source.