Single-thread 22.63 tok/s on 874M-parameter INT8 model beats TinyLlama-1.1B (16.31 tok/s) and Qwen2.5-1.5B (9.70 tok/s)?

Single-thread 22.63 tok/s on 874M-parameter INT8 model beats TinyLlama-1.1B (16.31 tok/s) and Qwen2.5-1.5B (9.70 tok/s).

Weight footprint reduced from 3.49 GB (FP32) to 1.06 GB (INT8), a 3.3x compression?

Weight footprint reduced from 3.49 GB (FP32) to 1.06 GB (INT8), a 3.3x compression.

Thread scaling reaches 47.90 tok/s at 4 threads; prefill speeds up to 94.68 tok/s at 8 threads?

Thread scaling reaches 47.90 tok/s at 4 threads; prefill speeds up to 94.68 tok/s at 8 threads.

Research & Papers

SymbolicLight V1 spike-aware runtime runs 2x faster on CPUs

arXiv cs.NE June 03, 2026

⚡22.63 tokens/s on a single thread beats Qwen2.5-1.5B by 2.3x.

Deep Dive

Researchers from the SymbolicLight project have published a systems-oriented paper on spike-aware inference for sparse spiking language models. The work, led by Ting Liu, introduces a custom C++ CPU runtime that treats sparse binary spike states as first-class execution primitives rather than relying solely on post-hoc weight compression. The runtime combines a manifest-driven weight loader, mixed row/column memory layout, AVX2/FMA kernels, per-channel symmetric INT8 quantization, and integer-domain accumulation for spike-conditioned sparse paths. On an AMD Ryzen 7 5800X, an early scalar FP32 baseline decoded at 9.5 tokens/s. Mixed-layout AVX2 FP32 raised this to 14.7 tokens/s, and AVX2 INT8 reached 19.9 tokens/s on a step-30k export while reducing the weight footprint from 3.49 GB to 1.06 GB. For the available 186k-step, 874M-parameter INT8 export, the C++ runtime decoded at 22.63 tokens/s in a single-thread CPU benchmark. This compares favorably with TinyLlama-1.1B Q8_0 at 16.31 tokens/s, Falcon3-1B Q8_0 at 11.26 tokens/s, and Qwen2.5-1.5B Q8_0 at 9.70 tokens/s under llama.cpp. Thread scaling reached 47.90 tokens/s at four CPU threads, and 512-token prefill improved from 29.86 to 94.68 tokens/s from one to eight threads.

The throughput gains come with a quality sacrifice: the spiking model reports WikiText-2 perplexity of 24.80, worse than the dense baselines in the same benchmark. The authors frame the work as an inference-systems study for sparse language runtimes, with longer-term motivation in embodied and edge agents that may benefit from local, low-core inference near sensors and actuators. They acknowledge that model quality, controlled dense training baselines, embodied-task evaluation, and measured CPU energy remain open problems. The paper does not provide energy measurements or comparisons with GPU inference, limiting its practical applicability for professionals seeking production-ready solutions. Nonetheless, the approach demonstrates that spike-aware execution can meaningfully improve CPU throughput and memory behavior for spiking language models, potentially opening pathways for low-resource deployments where latency and memory are more critical than accuracy.

Key Points

Single-thread 22.63 tok/s on 874M-parameter INT8 model beats TinyLlama-1.1B (16.31 tok/s) and Qwen2.5-1.5B (9.70 tok/s).
Weight footprint reduced from 3.49 GB (FP32) to 1.06 GB (INT8), a 3.3x compression.
Thread scaling reaches 47.90 tok/s at 4 threads; prefill speeds up to 94.68 tok/s at 8 threads.

Why It Matters

Spike-aware inference brings fast, local language model execution to commodity CPUs, enabling edge and embedded AI agents.

Read Original Article

SymbolicLight V1 spike-aware runtime runs 2x faster on CPUs

Why It Matters

Related Articles

🚀 Stay Ahead in AI