Open Source

Designed a photonic chip for O(1) KV cache block selection — 944x faster, 18,000x less energy than GPU scan at 1M context

r/LocalLLaMA March 23, 2026

⚡A nanophotonics PhD student's optical chip design could eliminate the KV cache bottleneck, using light to scan 1M tokens in constant time.

Deep Dive

A nanophotonics PhD student has proposed a radical hardware solution to one of the biggest bottlenecks in modern AI inference: scanning the Key-Value (KV) cache. The design, called PRISM (Photonic Reconfigurable In-memory Similarity-search Machine), uses light instead of electricity to find relevant data blocks. In current systems like NVIDIA's H100, scanning all block signatures for a 1M-token context takes about 8.5 microseconds per query, an O(N) operation that dominates batch serving latency. PRISM encodes the user's query as light, passively splits it to all N memory blocks simultaneously, and uses microring resonators (MRRs) to compute similarity scores in parallel. This achieves constant-time O(1) selection, regardless of context length.

Simulation results are staggering. At a 1 million token context, PRISM promises selection speeds 944 times faster than a GPU scan while consuming 18,000 times less energy. Projections for a 100 million token context show a 5.3x faster total decode time compared to advanced software methods like Quest, when running models like Qwen2.5-7B. While the photonic chip itself is not yet fabricated—the numbers come from device-physics simulations on thin-film lithium niobate (TFLN)—the project includes a practical, GPU-only block selector that is available today. This software component reportedly achieves 100% retrieval on needle-in-a-haystack tests with no performance drop on the LongBench-v2 benchmark, offering an immediate tool for developers.

Key Points

PRISM uses optical broadcast for O(1) KV cache selection, replacing the O(N) GPU scan that bottlenecks long-context inference.
Device-physics simulations show 944x faster selection and 18,000x lower energy use at 1M context versus an H100 GPU scan.
The project includes a working, open-source GPU-only block selector with 100% needle retrieval and no drop on LongBench-v2.

Why It Matters

This breakthrough could make interacting with million-token contexts instantaneous and energy-efficient, unlocking practical long-context AI for enterprise and consumer applications.

Read Original Article

Designed a photonic chip for O(1) KV cache block selection — 944x faster, 18,000x less energy than GPU scan at 1M context

Why It Matters

Stay Ahead in AI