What can you do if your hardware can generate 15,000 token/s?
New 6nm chip runs LLMs 100x faster than GPUs by permanently etching models into silicon.
A startup called Taalas is pioneering a radical approach to AI inference that could redefine hardware efficiency. Instead of using flexible but memory-bound GPUs, the company creates custom silicon where a specific large language model (LLM) is permanently etched, or 'burned,' into the chip's physical circuitry. This hardware-software co-design, demonstrated on their 6nm process chip, allows the model to run with extreme efficiency, reportedly achieving a staggering 15,000 tokens per second. The architecture eliminates the massive, costly memory subsystems (like HBM) required by GPUs, dramatically reducing production costs and power consumption.
However, this breakthrough comes with a fundamental trade-off: a complete lack of model flexibility. Each Taalas chip is a single-purpose device, hardwired to run only the specific model version it was fabricated for, such as Llama 3 70B or Qwen 2.5 32B. You cannot swap or update the model without manufacturing a new chip. This makes the technology ideal for high-volume, fixed-function deployments but unsuitable for general-purpose AI development. The company is showcasing this capability through a demo platform called chatjimmy.ai.
The potential applications for such speed are transformative, particularly for real-time, interactive media. As highlighted in community discussions on r/LocalLLM, this could enable live-streaming AI-generated movies or video games with unprecedented dynamism. Imagine a massively multiplayer online (MMO) game where every non-player character (NPC) is powered by a unique, high-quality LLM, generating fully personalized dialogue instantly for thousands of concurrent players without any perceptible delay. This moves AI from a tool for batch processing to a core engine for real-time, immersive experiences.
- Achieves 15,000 tokens/sec by hardwiring a specific LLM directly into 6nm chip silicon.
- Eliminates need for expensive GPU memory (HBM), drastically cutting production cost and power.
- Major trade-off: zero model flexibility—each chip runs only one permanently etched model.
Why It Matters
Enables real-time, immersive AI applications like interactive NPCs in games and live AI-generated media at scale.