Open Source

Free ASIC Llama 3.1 8B inference at 16,000 tok/s - no, not a joke

A startup's custom hardware achieves unprecedented speed, offering free API access as a proof-of-concept.

Deep Dive

Hardware startup Taalas has launched a free chatbot and API endpoint powered by its custom ASIC chip. The system runs Meta's Llama 3.1 8B model at a staggering 16,000 tokens per second. While the small model has limited capabilities, Taalas is offering this as a public demo to showcase the potential of specialized inference hardware for achieving extreme speed, which could benefit applications requiring ultra-fast, basic AI responses.

Why It Matters

It demonstrates a path to making AI inference radically faster and cheaper, potentially enabling new real-time applications.