Research & Papers

Privacy-Aware Split Inference with Speculative Decoding for Large Language Models over Wide-Area Networks

New method splits Mistral 7B across local/cloud GPUs, achieving 9.3 tokens/sec over 80ms WAN links.

Deep Dive

Michael Cunningham presents a privacy-aware split inference system for LLMs. It divides transformer layers between a trusted local device and untrusted cloud, using speculative decoding to overcome WAN latency. The method achieves 8.7-9.3 tok/s on Mistral 7B over 80ms links while protecting raw tokens. With an 8-layer split, token recovery attacks drop to ~35%. It requires only 4.9GB local VRAM for 12B models, matching 7B throughput.

Why It Matters

Enables enterprises to use powerful cloud LLMs for sensitive data without sacrificing performance or privacy.