Free ASIC Llama 3.1 8B inference at 16,000 tok/s - no, not a joke
A startup's custom hardware achieves unprecedented speed, offering free API access as a proof-of-concept.
Deep Dive
Hardware startup Taalas has launched a free chatbot and API endpoint powered by its custom ASIC chip. The system runs Meta's Llama 3.1 8B model at a staggering 16,000 tokens per second. While the small model has limited capabilities, Taalas is offering this as a public demo to showcase the potential of specialized inference hardware for achieving extreme speed, which could benefit applications requiring ultra-fast, basic AI responses.
Why It Matters
It demonstrates a path to making AI inference radically faster and cheaper, potentially enabling new real-time applications.