Open Source

What a time to be alive from 1tk/sec to 20-100tk/sec for huge models

r/LocalLLaMA May 04, 2026

⚡Massive advancements let users run powerful AI models at unprecedented speeds.

Deep Dive

Hardware that previously ran Llama405b q4 at 1.2tk/sec now runs state-of-the-art models - including kimik2.6, deepseekv4flash, minimax2.7, step3.5flash, and qwen3.5-397b - at speeds between 30tk/sec and 100tk/sec, while "crushing" Llama405b. For a few hundred dollars, users can run qwen3.6-36b at 50tk/sec at home. The author notes that early experiments with running large models at slow speeds were dismissed as "stupid" or "waste of time," but those efforts are now paying off.

Key Points

Llama 4.0 can now run at 30tk-100tk/sec, a significant increase from 1.2tk/sec.
Models like Qwen 3.6-36b can be run at home for only a few hundred dollars.
Advancements make sophisticated AI accessible for local experimentation and development.

Why It Matters

Unlocks powerful AI capabilities for developers, enhancing innovation and experimentation.

Read Original Article

What a time to be alive from 1tk/sec to 20-100tk/sec for huge models

Why It Matters

Stay Ahead in AI