A 9-billion parameter model that beats one 13 times its size. Native vision capabilities that work offline. Memory requirements low enough for a standard laptop. Alibaba’s Qwen 3.5 Small series, released March 2, changes what’s possible with local AI.
The timing is notable. While Apple’s M5 Max makes headlines for running 70B models, and enterprises scramble to secure cloud AI pipelines after the OpenClaw supply chain disaster, Alibaba quietly shipped something potentially more significant: genuinely capable AI that runs on hardware most people already own.
The Lineup
Qwen 3.5 Small spans four model sizes, each targeting different hardware constraints:
Qwen3.5-0.8B - The ultralight option. Runs on phones with 2-3GB of memory. Useful for basic chat and simple vision tasks.
Qwen3.5-2B - Mobile-ready at 4-5GB. Compatible with iPhone 15 Pro and similar devices. Handles more complex reasoning while staying responsive.
Qwen3.5-4B - The multimodal sweet spot at 6-7GB. Delivers performance close to the previous Qwen3-80B-A3B despite being 20 times smaller. This is where native vision capabilities start to shine.
Qwen3.5-9B - The flagship small model. At 10-16GB total memory (RAM plus VRAM), it runs on standard 16GB laptops without a dedicated GPU. One tester reported 30 tokens per second on an AMD Ryzen AI Max+395.
All models share the same architecture as the full 397B Qwen 3.5, just scaled down. They’re not distillations or approximations. The same Gated DeltaNet hybrid design, the same 201-language support, the same multimodal fusion approach.
Benchmarks That Matter
The 9B model’s numbers are striking. On MMLU-Pro, it scores 82.5 compared to 80.8 for OpenAI’s gpt-oss-120B, a model with 13 times more parameters. On GPQA Diamond, a graduate-level science benchmark: 81.7 versus 80.1.
Vision performance is equally impressive. On MMMU-Pro, a visual reasoning benchmark, Qwen3.5-9B scores 70.1, which is 22.5% higher than GPT-5-Nano’s 57.2. Video understanding hits 84.5 on VideoMME with subtitles.
The context window extends to 262,144 tokens natively, with YaRN scaling pushing that to over 1 million tokens for those willing to trade speed for length.
For agentic tasks, the model scores 66.1 on BFCL-V4 and 79.1 on TAU2-Bench. Not state-of-the-art, but respectable for something running on consumer hardware.
What “Native Multimodal” Actually Means
Previous approaches to adding vision to language models typically bolted an image encoder onto a pretrained text model. The result often felt like two systems awkwardly communicating rather than one unified intelligence.
Qwen 3.5’s architecture is different. According to Alibaba’s technical documentation, the vision encoder uses Conv3d patch embeddings to capture temporal dynamics in video, merging features from multiple layers rather than just the final layer. This early fusion approach means text, images, and video occupy the same representational space from the start of training.
The practical result: the 0.8B model can handle video understanding. Not well, but it works. The 4B and 9B variants handle complex visual reasoning tasks that would have required dedicated vision models a year ago.
Running It Locally
Getting started takes minutes. The simplest path is Ollama:
ollama pull qwen3.5:9b
ollama run qwen3.5:9b
The 9B variant downloads as a 6.6GB Q4_K_M quantized model. For most tasks, this quantization introduces negligible quality loss while dramatically reducing memory requirements.
For multimodal work with images and video, llama.cpp offers more flexibility:
# Download the GGUF model from Hugging Face
# Launch with appropriate context window
./llama.cpp/llama-cli -m qwen3.5-9b-q4_k_m.gguf \
-c 8192 \
--chat-template qwen3.5
LM Studio provides a GUI option for those who prefer clicking over typing.
Thinking Mode
Like other recent reasoning models, Qwen 3.5 supports a “thinking” mode where it reasons through problems step by step. The recommended parameters for general tasks:
- Temperature: 1.0
- Top-p: 0.95
- Top-k: 20
- Presence penalty: 1.5
For coding tasks, drop the temperature to 0.6 and remove the presence penalty.
Thinking mode can be disabled for faster responses when you don’t need extended reasoning:
response = client.chat.completions.create(
model="Qwen/Qwen3.5-9B",
messages=messages,
extra_body={
"chat_template_kwargs": {"enable_thinking": False},
},
)
The Privacy Advantage
Running AI locally isn’t just about avoiding subscription fees. Every query sent to cloud AI services becomes training data, usage metrics, and a permanent record tied to your account. The recent Firebase breaches exposed 300 million messages from AI chat apps. The OpenClaw supply chain attack compromised one in five packages in an enterprise agent ecosystem.
Local models eliminate these risks entirely. Your prompts never leave your machine. There’s no API key to leak, no Firebase database to misconfigure, no third-party dependency to compromise.
For sensitive work, such as legal documents, medical information, proprietary code, or personal journaling, this isn’t a nice-to-have. It’s increasingly the only responsible choice.
Limitations
The 9B model isn’t GPT-5 or Claude Opus 4.6. Complex multi-step reasoning, subtle instruction following, and creative writing still favor larger cloud models. The vision capabilities, while impressive for the size, won’t replace dedicated image analysis systems for professional use.
Speed depends heavily on hardware. CPU-only inference works but crawls. Even modest dedicated GPU acceleration or Apple Silicon neural engines make a significant difference.
And despite “201 language support,” English and Chinese get the most training data. Performance in other languages varies.
The Bigger Picture
A year ago, running capable multimodal AI locally meant either expensive hardware or significant quality compromises. Qwen 3.5 Small compresses that gap dramatically.
The 9B model, running on a standard laptop, outperforms cloud models with 13 times more parameters on multiple benchmarks. The 4B variant delivers near-80B performance in a 7GB package. The entire family ships under Apache 2.0, meaning you can use, modify, and deploy them commercially without restrictions.
For privacy-conscious users, developers building offline-capable applications, or anyone tired of paying per-token for cloud AI, these models represent a genuine inflection point.
The Bottom Line
Qwen 3.5 Small makes capable, private, multimodal AI accessible on hardware most people already own. The 9B model is the sweet spot for laptops with 16GB of memory.