What if instead of loading an AI model into memory, you burned it directly into the chip itself?
That’s exactly what Taalas, a Toronto-based startup, has done with its HC1 chip. The result: 17,000 tokens per second from Llama 3.1 8B. For context, Nvidia’s B200 manages about 594 tokens per second on the same model. Cerebras hits around 2,000. Groq reaches 600.
The HC1 is not a GPU. It’s not even a traditional AI accelerator. The company has taken Llama 3.1 8B’s weights and etched them directly into the transistors using a mask ROM recall fabric. The model isn’t loaded into the chip. The model is the chip.
How It Works
Traditional AI inference has a bottleneck: moving data between memory and compute. Even the fastest GPUs spend most of their time waiting for weights to arrive from HBM. Taalas eliminates this entirely.
“We can store four bits away and do the multiply related to it with a single transistor,” explains CEO Ljubisa Bajic, who previously architected AMD’s hybrid CPU-GPU designs and founded Tenstorrent.
The HC1 uses TSMC’s 6nm process, packs 53 billion transistors onto an 815mm die, and draws just 200 watts per card. A full 10-card server uses 2,500 watts total - a fraction of what a comparable GPU rack consumes.
The company claims the HC1 is 10x faster than Cerebras, 20x cheaper to manufacture, and uses 10x less power.
The Trade-Off
There’s an obvious catch: you can’t change the model. The Llama 3.1 8B weights are permanently part of the silicon. Want to run a different model? You need a different chip.
Taalas argues this matters less than you’d think. They’ve built in flexibility through configurable context windows and support for LoRA (Low-Rank Adapters) - the fine-tuning technique that adds small trainable layers on top of frozen base weights. You can customize the model’s behavior without changing its core.
And when a new model version does come out? Changing two metal layers on the chip costs dramatically less than training the model in the first place, according to Bajic. The company claims a 100x cost differential between training and customizing their HC chips.
Their partnership with TSMC enables a two-month turnaround from receiving model weights to shipping production cards.
The Reality Check
The 17,000 tokens-per-second figure deserves scrutiny. It’s measured on Llama 3.1 8B under optimal conditions. A well-configured H200 running a 12-billion-parameter model in FP8 precision can already approach 12,000 tokens per second.
The gap between “28x faster than B200” and “modestly faster than an optimized GPU on a comparable task” depends heavily on what you measure and how. Comparing a hardcoded 8B model to GPUs running larger, more capable models isn’t entirely fair.
Still, for high-volume inference at a fixed model - think API providers running millions of queries against the same base model - the economics could be compelling.
The Team and Funding
Taalas emerged from stealth with over $200 million raised across three rounds. The founding team comes from AMD, Tenstorrent, Apple, Google, and Nvidia. CEO Ljubisa Bajic and CTO Drago Ignjatovic both spent years designing GPUs and APUs at AMD.
The company has spent $30 million on R&D with $170+ million still in reserve. They have 25 employees and recently opened an online demo chatbot where anyone can test the hardware’s speed.
What’s Next
The HC1 with Llama 3.1 8B is shipping now. A mid-sized reasoning model variant launches in Q2. The HC2 platform, targeting 20 billion parameters with higher density and faster execution, is expected late 2026.
Frontier-class LLM support - possibly a full Llama or DeepSeek model - is planned for year-end.
What This Means
The HC1 represents an extreme position in the flexibility vs. performance trade-off. GPUs can run any model but leave performance on the table. Taalas chips achieve maximum performance but lock you into a single model.
For consumer local AI, this approach probably doesn’t make sense. Models evolve too quickly, and flexibility matters.
But for inference providers running the same model at massive scale, the math changes. If you’re serving billions of Llama queries per day and you know you’ll be on that model for at least six months, dedicated silicon starts to look attractive.
The broader trend is clear: as LLMs mature and settle into production, the hardware industry is fragmenting. The “GPU for everything” era may be ending. Specialized inference accelerators from Groq, Cerebras, SambaNova, and now Taalas are all betting that inference at scale demands purpose-built silicon.
Taalas just took that logic to its extreme conclusion.