Media & Culture

Taalas: LLMs baked into hardware. No HBM, weights and model architecture in silicon -> 16.000 tokens/second

Their radical approach etches models onto custom ASICs in just 60 days, achieving <1ms latency.

Deep Dive

Taalas has developed a radical hardware approach where entire LLMs, including weights and architecture, are etched directly onto custom silicon chips (ASICs). Their demonstrator runs Llama 3.1 8B at >16,000 tokens/second with <1ms latency, is 10x more power efficient, and claims to be 20x cheaper to produce. They can go from an unseen software model to a finished chip in 60 days, eliminating the need for expensive HBM memory and complex cooling.

Why It Matters

This could enable ultra-low-latency, cost-effective AI for real-time applications like speech synthesis, avatars, and computer vision where speed is critical.