26M parameters, 0 MLPs – uses only attention and gating (Simple Attention Networks)?

26M parameters, 0 MLPs – uses only attention and gating (Simple Attention Networks).

Outperforms FunctionGemma-270M and Qwen-0.6B on single-shot function calling tasks?

Outperforms FunctionGemma-270M and Qwen-0.6B on single-shot function calling tasks.

Open Source

Cactus Compute's Needle runs tool calling at 6K tok/s with only 26M params

r/LocalLLaMA May 13, 2026

⚡No MLPs, just attention – a 26M model beating 270M+ rivals on function calling.

Deep Dive

Cactus Compute open-sourced Needle, a 26M parameter function-calling model. It achieves 6000 tok/s prefill and 1200 tok/s decode on consumer devices. The architecture uses only attention and gating (no MLPs), making it ideal for on-device agents on phones, watches, and glasses. Pre-trained on 200B tokens and post-trained on 2B tokens of synthesized tool-calling data (generated via Gemini), it outperforms FunctionGemma-270M, Qwen-0.6B, and others in single-shot scenarios. However, those models have more scope and excel in conversational settings.

Key Points

26M parameters, 0 MLPs – uses only attention and gating (Simple Attention Networks).
Runs at 6000 tok/s prefill and 1200 tok/s decode on consumer devices.
Outperforms FunctionGemma-270M and Qwen-0.6B on single-shot function calling tasks.

Why It Matters

Enables real-time, privacy-preserving AI agents on phones, watches, and IoT devices without cloud dependency.

Read Original Article

Cactus Compute's Needle runs tool calling at 6K tok/s with only 26M params

Why It Matters

Related Articles

🚀 Stay Ahead in AI