Decode speed of ~37 tokens/s on Ryzen 7950X, no GPU required?

Decode speed of ~37 tokens/s on Ryzen 7950X, no GPU required

Memory usage fits in ~7GB, runs on any 16GB RAM computer?

Memory usage fits in ~7GB, runs on any 16GB RAM computer

Supports weight sharing across agents and cloning to skip repeated prefill work?

Supports weight sharing across agents and cloning to skip repeated prefill work

Open Source

Developer MaximeCB builds Rust-native CPU-only LFM2.5-8B-A1B running at 37 tok/s

r/LocalLLaMA June 09, 2026

⚡8B parameter model runs entirely on CPU, uses ~7GB RAM, decode at 37 tokens/s

Deep Dive

Developer maximecb published a Rust-native language model implementation as a cargo crate, still a work in progress. On a Ryzen 7950x, decode speed is nearly 37 tokens/s with memory usage around 7GB, fitting comfortably on a machine with 16GB RAM. It includes callbacks for tool use, allows weight reuse across Agent instances with separate KV caches, and supports cloning Agent objects to avoid repeating prompt prefill. The prefill speed is not yet optimized—currently similar to decode—but the developer is working on speeding it up.

Key Points

Decode speed of ~37 tokens/s on Ryzen 7950X, no GPU required
Memory usage fits in ~7GB, runs on any 16GB RAM computer
Supports weight sharing across agents and cloning to skip repeated prefill work

Why It Matters

Enables local, CPU-only inference of large models, democratizing access to powerful AI without expensive GPUs.

Read Original Article

Developer MaximeCB builds Rust-native CPU-only LFM2.5-8B-A1B running at 37 tok/s

Why It Matters

Related Articles

Stay Ahead in AI