Small Models, Big Brain: When 4 Billion Parameters Match GPT-4

A 4-billion-parameter model just scored 81.3% on the AIME 2025 math competition — a test where GPT-4o managed only 12%. That is not a typo.

Qwen3-4B-Thinking, a model small enough to run on a phone, now solves math problems that stumped the most expensive AI systems on the planet less than two years ago. And it is not alone. A new generation of sub-10-billion-parameter models from Alibaba, Microsoft, Google, Mistral, and others has quietly closed the gap with frontier AI on reasoning, coding, math, and tool use.

The benchmarks tell a story that would have sounded absurd in early 2024: tiny models that match or beat GPT-4-class performance on specific tasks. Here is what the numbers actually show.

The Contenders

Six models define the current small model class. All are open-weight, all run on consumer hardware, and all were released between March 2025 and December 2025.

Qwen3-4B-Thinking-2507 (4B parameters, Alibaba). Released August 2025. A dense transformer with a built-in thinking mode that enables chain-of-thought reasoning before answering. The “2507” revision brought major improvements to instruction following, reasoning, and tool use.

Qwen3-4B-Instruct-2507 (4B parameters, Alibaba). The non-thinking sibling. Faster responses, no extended reasoning chain. Still surprisingly strong on benchmarks.

Phi-4-mini-instruct (3.8B parameters, Microsoft). Released February 2025. Microsoft’s entry in the small model arms race, with dedicated training on math, reasoning, and function calling tasks. A reasoning variant (Phi-4-mini-reasoning) followed in April 2025.

Gemma 3 4B (4.3B parameters, Google). Released March 2025. Multimodal (text and image), 130K token context window, function calling support. Google claims it matches the performance of the previous generation’s 27B model.

Ministral 3 8B Reasoning (8B parameters, Mistral). Released December 2025. Mistral’s small reasoning model with dedicated chain-of-thought capabilities. Slightly larger than the others at 8B, but still runs comfortably on consumer GPUs.

DeepSeek-R1-Distill-Qwen-7B (7B parameters, DeepSeek). Released January 2025. A distilled version of DeepSeek’s R1 reasoning model into the Qwen2 7B architecture. One of the first small models to demonstrate that reasoning capabilities could be transferred from massive models to tiny ones.

The Baseline: What Frontier Models Scored

To understand why the small model results matter, you need to know what the big models achieved when they were the state of the art.

Benchmark	GPT-4 (Mar 2023)	GPT-4o (2024)	Claude 3 Opus (Mar 2024)
MMLU	86.4%	88.7%	86.8%
MATH	42.2%	76.6%	—
HumanEval	67.0%	90.2%	55.0%
GSM8K	~92%	—	—
GPQA	—	53.6%	—
AIME 2024	—	~12%	—

These were the best numbers in the world at their time of release. GPT-4’s MMLU score of 86.4% was considered a breakthrough in March 2023. GPT-4o’s MATH score of 76.6% represented major progress in mid-2024.

Now look at what costs nothing and fits in 3GB of VRAM.

The Numbers: Small Models vs. Frontier

Math and Reasoning

This is where the gap has closed the most dramatically. The AIME (American Invitational Mathematics Examination) is a competition-level math test well above typical high school problems. MATH-500 tests a broad range of mathematical problem-solving.

Model	Params	AIME 2024	AIME 2025	MATH / MATH-500	GPQA Diamond
GPT-4o (2024)	~1.8T*	~12%	—	76.6%	53.6%
Qwen3-4B-Thinking	4B	73.8%	81.3%	—	65.8%
Qwen3-4B-Instruct (non-thinking)	4B	—	47.4%	97.0% (MATH-500)	62.0%
Ministral 3 8B Reasoning	8B	86.0%	78.7%	87.6% (Maj@1)	66.8%
Phi-4-mini-reasoning	3.8B	—	33.6%	92.5% (MATH-500)	45.1%
DeepSeek-R1-Distill-Qwen-7B	7B	55.5%	—	92.8% (MATH-500)	49.1%

*GPT-4o’s parameter count is estimated; OpenAI has not disclosed it.

Read that table carefully. Qwen3-4B-Thinking scores 81.3% on AIME 2025. GPT-4o scored approximately 12% on AIME 2024 — a different year’s test, but of comparable difficulty. A model that is roughly 450 times smaller is solving competition math at roughly 6-7 times the rate.

Ministral 3 8B Reasoning scores 86.0% on AIME 2024 and 66.8% on GPQA Diamond, beating GPT-4o’s 53.6% on the latter. On the GPQA Diamond graduate-level science reasoning test, the 4B Qwen3-Thinking scores 65.8% — 12 points above GPT-4o’s result.

The MATH-500 results tell a similar story. Qwen3-4B-Instruct (non-thinking mode) hits 97.0%. Phi-4-mini-reasoning gets 92.5%. DeepSeek-R1-Distill-Qwen-7B manages 92.8%. GPT-4o scored 76.6% on the MATH benchmark. These are not the same exact test (MATH vs. MATH-500), but they draw from the same problem distribution, and the direction is clear.

General Knowledge (MMLU)

MMLU is the classic broad knowledge test — 57 subjects from elementary math to professional law. MMLU-Pro is a harder, more discriminating version.

Model	Params	MMLU	MMLU-Redux	MMLU-Pro
GPT-4 (Mar 2023)	~1.8T	86.4%	—	—
GPT-4o (2024)	~1.8T	88.7%	—	—
GPT-4o-mini (2024)	—	77.2%	—	62.8%
Claude 3 Opus (2024)	—	86.8%	—	—
Qwen3-4B-Thinking	4B	—	86.1%	74.0%
Qwen3-4B-Instruct	4B	—	84.2%	69.6%
Ministral 3 8B Reasoning	8B	76.1%	79.3%	—
Phi-4-mini-instruct	3.8B	67.3%	—	52.8%
Llama 3.2 3B	3B	63.4%	—	—
Gemma 3 4B	4.3B	—	—	43.6%

This is where the picture gets more nuanced. On standard MMLU, the small models do not match GPT-4’s 86.4%. Phi-4-mini hits 67.3%, Ministral 3 8B reaches 76.1%, and Llama 3.2 3B lands at 63.4%. That is a real gap.

But look at MMLU-Redux (a cleaner version of the test that fixes known issues with the original MMLU). Qwen3-4B-Thinking scores 86.1% — essentially tying GPT-4’s MMLU score. And on MMLU-Pro, the harder version, Qwen3-4B-Thinking hits 74.0%, which is 11 points above GPT-4o-mini’s 62.8%.

The pattern: small thinking models close the gap on knowledge tests. Non-thinking models and models without dedicated reasoning chains still lag behind.

Coding

Model	Params	HumanEval	LiveCodeBench v6	MultiPL-E
GPT-4 (Mar 2023)	~1.8T	67.0%	—	—
GPT-4o (2024)	~1.8T	90.2%	—	—
Phi-4-mini-instruct	3.8B	74.4%	—	—
Qwen3-4B-Thinking	4B	—	55.2%	—
Qwen3-4B-Instruct	4B	—	35.1%	76.8%
Ministral 3 8B Reasoning	8B	—	61.6%	—
DeepSeek-R1-Distill-Qwen-7B	7B	—	37.6%	—

Phi-4-mini’s HumanEval score of 74.4% beats GPT-4’s original 67.0% from March 2023. That is a 3.8-billion-parameter model from Microsoft outscoring what was the most expensive model in the world two years earlier on code generation.

LiveCodeBench (a contamination-resistant coding benchmark that uses new problems) shows Ministral 3 8B Reasoning at 61.6% and Qwen3-4B-Thinking at 55.2%. These benchmarks are harder to compare directly against frontier models because LiveCodeBench did not exist when GPT-4 was released, but the scores put these small models in competitive territory with mid-2024 frontier systems.

Instruction Following (IFEval)

IFEval tests whether models can follow specific, verifiable instructions like “write more than 400 words” or “include the keyword ‘neural’ at least three times.”

Model	Params	IFEval
Qwen3-4B-Thinking	4B	87.4%
Qwen3-4B-Instruct	4B	83.4%
Llama 3.2 3B	3B	77.4%
Phi-4-mini-instruct	3.8B	—

Qwen3-4B-Thinking’s 87.4% on IFEval is a strong result. For context, many frontier models in early 2024 scored in the 80-85% range on this benchmark. A 4B model following instructions at this level means it can reliably handle structured prompts, system instructions, and output format requirements — exactly what you need for tool use and agent workflows.

Function Calling and Tool Use (BFCL)

The Berkeley Function Calling Leaderboard (BFCL) tests a model’s ability to select the right function, generate correct arguments, and handle multi-step tool use chains. This is the benchmark that matters most for agent applications.

Model	Params	BFCL-v3
Qwen3-4B-Thinking	4B	71.2%
Qwen3-4B-Instruct	4B	61.9%

Qwen3-4B-Thinking’s 71.2% on BFCL-v3 stands out. The thinking mode variant consistently outperforms the instruct variant on agentic tasks, which makes sense — having a reasoning chain helps a model figure out which tool to use and what arguments to pass.

The TAU benchmark results tell a similar story. On TAU-Retail (a task-oriented dialogue benchmark for retail scenarios), Qwen3-4B-Thinking scores 66.1% on TAU1 and 53.5% on TAU2. For a 4-billion-parameter model, these numbers indicate genuine tool-use capability, not just pattern matching.

Where They Shine

Three areas stand out where small models in 2025-2026 have caught up or gotten close enough to matter:

Mathematical reasoning. The thinking mode models (Qwen3-4B-Thinking, Ministral 3 8B Reasoning, Phi-4-mini-reasoning) perform well above their weight class. Chain-of-thought reasoning lets them “show their work” and arrive at correct answers on problems that stump much larger models running in single-pass mode. Qwen3-4B-Thinking’s 81.3% on AIME 2025 and Ministral 3 8B’s 86.0% on AIME 2024 are numbers that would have been competitive with frontier reasoning models from early 2025.

Structured output and function calling. The combination of improved instruction following (IFEval 87.4% for Qwen3-4B-Thinking) and dedicated function calling training (BFCL-v3 71.2%) means these models can reliably generate JSON, call APIs, and participate in multi-step workflows. This is the capability that unlocks practical agent applications.

Math problem-solving at scale. On the MATH-500 benchmark, three sub-10B models score above 92% (Qwen3-4B-Instruct at 97.0%, Phi-4-mini-reasoning at 92.5%, DeepSeek-R1-Distill-Qwen-7B at 92.8%). GPT-4o’s MATH score of 76.6% from mid-2024 is firmly behind. For applications that need reliable math computation — data analysis pipelines, financial calculations, scientific tooling — a local 4B model can now do the job.

Where They Still Fall Short

The benchmarks do not lie in either direction. Here is where small models remain clearly behind:

Broad general knowledge. Standard MMLU scores for non-thinking small models cluster between 63% and 76%. GPT-4-class models sit at 86-89%. A 4B model simply does not have the capacity to store as much factual knowledge as a trillion-parameter system. The thinking variants close this gap significantly (Qwen3-4B-Thinking hits 86.1% on MMLU-Redux), but only by spending inference time reasoning through answers rather than recalling them directly.

Long and complex generation. Small models produce good short-to-medium outputs but degrade on tasks requiring sustained coherence over thousands of tokens. Creative writing, long-form analysis, and multi-document synthesis remain areas where larger models have a clear edge. The Arena-Hard scores confirm this: Qwen3-4B-Thinking scores 34.9%, while GPT-4o-mini hits 53.7%.

Multilingual breadth. Phi-4-mini scores 49.3% on Multilingual MMLU versus GPT-4o-mini’s 72.9%. Qwen3-4B does better at 69.0% on MultiIF, partly because Alibaba trains heavily on Chinese and multilingual data, but the gap is still real for most non-English languages.

Raw coding ability on hard problems. LiveCodeBench scores for small models (35-62%) are well behind what top frontier models achieve. Complex multi-file refactoring, architecture decisions, and advanced algorithms still need bigger brains.

Nuance and ambiguity. GPQA Diamond scores (49-67% for small models vs. 53.6% for GPT-4o) show that graduate-level reasoning about ambiguous scientific questions remains hard at small scale, even though some small thinking models are now competitive.

What This Means for Running AI Locally

The practical question: can you actually use these models on your own hardware?

Speed on Consumer Hardware

Benchmark data from community testing gives a picture of real-world inference speed:

Model	Hardware	Approximate Speed
Qwen3 4B (Q4)	RTX 5090	~40 tok/s
Qwen3 4B (Q4)	M4 Max	~25 tok/s
Qwen3 4B (FP16)	RTX 5090	~45 tok/s
Gemma 3 4B (Q4)	RTX 5090	~50 tok/s
Gemma 3 4B (FP16)	RTX 5090	~60 tok/s
4B models (Q4)	RTX 3060 12GB	40-60 tok/s
4B models (Q4)	RTX 4060 8GB	40-50 tok/s
7-8B models (Q4)	RTX 3060 12GB	20-40 tok/s

Anything above 20 tokens per second feels responsive for interactive use. Above 50 feels nearly instant. A 4B model quantized to Q4_K_M fits in roughly 2.5-3GB of VRAM and runs at comfortable interactive speeds on nearly any modern GPU — including laptop GPUs and Apple Silicon Macs with 8GB of unified memory.

The thinking mode models are slower in wall-clock time because they generate a reasoning chain before the final answer. A question that takes a non-thinking Qwen3-4B model 2 seconds might take the thinking variant 10-15 seconds. The tradeoff is much better accuracy on hard problems.

VRAM Requirements

Model	FP16 VRAM	Q4_K_M VRAM
Qwen3 4B	~8 GB	~2.5 GB
Phi-4-mini (3.8B)	~7.6 GB	~2.4 GB
Gemma 3 4B	~8.6 GB	~2.8 GB
Ministral 3 8B	~16 GB	~5 GB
DeepSeek-R1-Distill-Qwen-7B	~14 GB	~4.5 GB

The 4B models in Q4 quantization are the sweet spot. They fit on an 8GB GPU with room to spare for context, they run fast, and they now bring genuine reasoning capability. An RTX 3060 12GB can run any of these models with headroom for 8K+ token contexts.

The Bottom Line

The claim that a 4B parameter model can match GPT-4 needs a big asterisk: on specific benchmarks, in specific domains, with thinking mode enabled.

Qwen3-4B-Thinking does not replace GPT-4o for all tasks. It can not match its breadth of knowledge, its multilingual fluency, or its ability to write 5,000 words of coherent prose. It can not replace a frontier model for complex multi-step planning or nuanced open-ended conversation.

But on math? On structured tool use? On following instructions and generating valid JSON? On solving specific coding problems? The 4B model is not just competitive — it beats what the most expensive AI on Earth could do 18 months ago.

That is the real story. Not that small models are “as good as” frontier models — they are not. The story is that the floor has risen so fast that a model you can run on a phone in airplane mode now handles tasks that required a $200/month API subscription and a datacenter in 2024.

For developers building agents, pipelines, and tools: you no longer need a frontier API for the reasoning-and-tool-use part of your stack. For privacy-conscious users: the “but local models are too dumb” argument is dead for a growing list of use cases. For anyone who thought the only path to better AI was bigger models: the small model results from 2025 suggest that training technique, data quality, and reasoning architecture matter at least as much as raw parameter count.

The 4B model revolution is not coming. It already happened. The benchmarks just took a while to catch up with the reality.