Open Source

Liquid AI's LFM2-24B-A2B running at ~50 tokens/second in a web browser on WebGPU

A 24-billion parameter MoE model runs locally in your browser at desktop speeds, no server needed.

Deep Dive

Liquid AI has made a significant leap in on-device AI by demonstrating its LFM2 family of Mixture-of-Experts (MoE) models running natively in a web browser. The showcase features two models: the LFM2-24B-A2B, with 24 billion total parameters (2 billion active per inference), and a smaller LFM2-8B-A1B variant. On an Apple M4 Max laptop, the 24B model generates text at approximately 50 tokens per second, while the 8B model exceeds 100 tokens per second. This performance is achieved using WebGPU, a modern web standard that gives JavaScript access to a machine's GPU, bypassing the need for local server setup or specialized AI software.

The company has released the demo's source code and pre-optimized ONNX models on Hugging Face, allowing developers to replicate the setup. ONNX (Open Neural Network Exchange) is a format designed for cross-platform efficiency, and its use here is key to the high browser-based performance. This approach fundamentally shifts how users can access powerful language models, moving them from remote cloud servers directly into a private, local browser session. It opens the door for applications requiring low latency, complete data privacy, and instant accessibility without installation.

Key Points
  • The LFM2-24B-A2B MoE model runs at ~50 tokens/sec in-browser on an M4 Max via WebGPU.
  • Smaller 8B variant achieves over 100 tokens/sec, demonstrating scalable performance for different hardware.
  • Optimized ONNX models are available on Hugging Face, enabling local, private, serverless AI inference.

Why It Matters

This brings powerful, private AI directly to users' devices, eliminating cloud dependency and latency for sensitive or real-time tasks.