Privacy-Aware Split Inference with Speculative Decoding for Large Language Models over Wide-Area Networks
New method splits Mistral 7B across local/cloud GPUs, achieving 9.3 tokens/sec over 80ms WAN links.
Deep Dive
Michael Cunningham presents a privacy-aware split inference system for LLMs. It divides transformer layers between a trusted local device and untrusted cloud, using speculative decoding to overcome WAN latency. The method achieves 8.7-9.3 tok/s on Mistral 7B over 80ms links while protecting raw tokens. With an 8-layer split, token recovery attacks drop to ~35%. It requires only 4.9GB local VRAM for 12B models, matching 7B throughput.
Why It Matters
Enables enterprises to use powerful cloud LLMs for sensitive data without sacrificing performance or privacy.