What a time to be alive from 1tk/sec to 20-100tk/sec for huge models
Massive advancements let users run powerful AI models at unprecedented speeds.
Deep Dive
Hardware that previously ran Llama405b q4 at 1.2tk/sec now runs state-of-the-art models - including kimik2.6, deepseekv4flash, minimax2.7, step3.5flash, and qwen3.5-397b - at speeds between 30tk/sec and 100tk/sec, while "crushing" Llama405b. For a few hundred dollars, users can run qwen3.6-36b at 50tk/sec at home. The author notes that early experiments with running large models at slow speeds were dismissed as "stupid" or "waste of time," but those efforts are now paying off.
Key Points
- Llama 4.0 can now run at 30tk-100tk/sec, a significant increase from 1.2tk/sec.
- Models like Qwen 3.6-36b can be run at home for only a few hundred dollars.
- Advancements make sophisticated AI accessible for local experimentation and development.
Why It Matters
Unlocks powerful AI capabilities for developers, enhancing innovation and experimentation.