Open-Weight LLM Showdown Week 8: Gemma 4 Rewrites the Rules, Then Trips Over Its Own Feet

Google dropped Gemma 4 on April 2 and the benchmarks are genuinely impressive. The 31B model ranks #3 on Arena AI, the 26B MoE variant punches far above its weight class, and—for the first time ever—it all ships under Apache 2.0.

Then people tried to run it. The speed situation is ugly enough to keep Qwen 3.5 on the throne for consumer hardware. Here’s the full picture.

Four Models, One License That Actually Matters

Gemma 4 comes in four sizes, all built from the same technology stack as Gemini 3:

Model	Parameters	Active Params	Context	Target
E2B	2.3B effective	2.3B	128K	Phones, IoT
E4B	4.5B effective	4.5B	128K	Laptops, edge
26B-A4B	26B total (MoE)	3.8B	256K	Consumer GPUs
31B	31B dense	31B	256K	Workstations

The E2B and E4B are “edge” models designed for devices. The 26B-A4B is a Mixture-of-Experts model with 128 small experts, activating 8 plus 1 shared per token. The 31B is a straightforward dense model.

All four handle multimodal input natively—images, video, variable resolutions—and support over 140 languages.

But the biggest news isn’t the architecture. It’s the license.

Apache 2.0: Why Lawyers Care More Than Engineers

For two years, Google’s Gemma models shipped under a custom license that made enterprise legal teams nervous. The previous Gemma Terms of Use included a “Harmful Use” carve-out requiring legal interpretation before adoption, required developers to enforce Google’s rules across all Gemma-based projects, and let Google update the prohibited-use policy unilaterally.

The result? Many enterprises chose Qwen or Mistral instead—not because those models were better, but because their Apache 2.0 licenses were something lawyers already understood.

Gemma 4 fixes this entirely. No custom clauses. No monthly active user limits. No acceptable-use enforcement. Full freedom for commercial and sovereign AI deployments. It’s the same license Qwen, Mistral, and most of the open-weight ecosystem already use.

For hobbyists running models at home, this changes nothing. For startups building products on open-weight models, it changes everything.

The Benchmarks Are Real

Previous Showdown standings had Qwen 3.5 leading, Llama 4 second, Gemma 3 a distant third. Gemma 4 reshuffles that deck:

Benchmark	Gemma 4 31B	Qwen 3.5 27B	Llama 4 Scout
MMLU Pro	85.2%	78.4%	74.8%
AIME 2026	89.2%	48.7%	42.1%
GPQA Diamond	~70%+	~55%	~50%
BigBench Extra Hard	74%	~40%	~35%

The math jump is staggering. Gemma 3 scored 20.8% on competition math. Gemma 4 hits 89.2%. That’s not an incremental improvement—it’s a generational leap.

The 26B-A4B MoE model achieves roughly 97% of the dense 31B’s quality while activating only 3.8B parameters per token. On paper, that’s the best quality-per-compute ratio in the open model space.

On paper.

The Speed Problem Nobody at Google Mentioned

Within 24 hours of release, the community found the catch. And it’s a bad one.

A vLLM bug report filed on April 3 documents the issue: Gemma 4 E4B generates at roughly 9 tokens per second on an RTX 4090 using vLLM. A similarly-sized Llama 3.2 3B? Over 100 tokens per second on the same hardware.

The root cause is architectural. Gemma 4 uses heterogeneous attention head dimensions that force serving frameworks to disable FlashAttention and fall back to slower Triton attention kernels. No FlashAttention means no hardware-accelerated inference. No hardware-accelerated inference means you’re running a sports car in first gear.

The 26B-A4B MoE model fares better but still disappoints compared to competitors. Real-world tests show around 11 tokens per second on consumer GPUs, while Qwen 3.5 hits 60+ tokens per second on the same hardware.

The MoE architecture creates a memory bandwidth bottleneck: all 26B parameters must reside in VRAM even though only 3.8B fire per token. You pay the full memory cost but don’t get proportional speed.

What You’ll Actually Need to Run It

Here’s the hardware reality for each model:

Model	VRAM Minimum	Recommended Hardware	Download Size
E2B	~2 GB	8GB RAM laptop, Raspberry Pi 5	~7.2 GB
E4B	~6 GB	Any GPU with 10+ GB, Apple Silicon	~9.6 GB
26B-A4B	~20 GB	RTX 3090/4090 (24 GB)	~18 GB
31B	24+ GB	RTX 3090/4090, or 48-96 GB for full 256K context	~20 GB

The 26B-A4B is the sweet spot for a 24 GB consumer GPU. It fits on an RTX 3090 with room for context. The 31B at full 256K context wants 48-96 GB—think dual GPUs or an M4 Max with 128 GB unified memory.

Getting started with Ollama is dead simple:

ollama pull gemma4        # pulls E4B by default
ollama pull gemma4:26b    # 26B-A4B MoE
ollama pull gemma4:31b    # 31B dense

Apple Silicon users should check MLX for optimized inference. The M4 Max at 546 GB/s memory bandwidth keeps large models in unified memory without offloading—throughput stays higher than you’d expect.

The Competitive Picture Right Now

Here’s where the open-weight field stands after Gemma 4:

Quality leader: Gemma 4 31B. No contest on reasoning and math. Arena #3 ranking backs it up.

Consumer speed champion: Qwen 3.5 35B-A3B. Still hits 60+ tokens per second on an RTX 3090 where Gemma manages 11. Until the inference speed gets fixed, Qwen owns consumer hardware.

Context window king: Llama 4 Scout. Its 10M token context window remains unmatched. Gemma 4’s 256K is generous but not in the same category.

Multilingual leader: Qwen 3.5. Gemma 4 supports 140+ languages natively, but Qwen’s multilingual performance hasn’t been dethroned yet.

Best edge model: Gemma 4 E2B. Running genuine multimodal intelligence on under 2 GB of memory is remarkable. Nothing else in this size class touches it.

What This Means

Gemma 4 is the most capable open-weight model family released to date. The benchmark gains are real, the Apache 2.0 licensing removes the last barrier to enterprise adoption, and the edge models push on-device AI forward in meaningful ways.

But capability without speed is academic interest without practical value. The inference problem isn’t a minor optimization issue—it’s a 10x throughput gap against competitors on the same hardware. Until Google or the community patches FlashAttention support for Gemma 4’s attention architecture, most consumer GPU users should stick with Qwen 3.5 for daily use.

The smart move: run Gemma 4 E2B or E4B on edge devices where its multimodal capabilities shine. Use the 26B-A4B or 31B for tasks where quality matters more than speed—complex reasoning, math, document analysis. Keep Qwen 3.5 as your everyday workhorse.

And watch the vLLM and llama.cpp repos closely. The moment FlashAttention support lands for Gemma 4’s architecture, the rankings will shift again. The raw capability is there. It just needs an engine that can keep up.