6-GPU multiplexer from K80s ‚ hot-swap between models in 0.3ms
A custom kernel module switches between loaded models in just 0.3 milliseconds using repurposed K80 GPUs.
An independent developer, leveraging skills honed on the Boot AI project, has engineered a novel system that multiplexes six GPU dies through a single PCIe slot. The hardware core consists of a repurposed BTC-S37 cryptocurrency mining motherboard and three NVIDIA K80 dual-GPU cards, providing a total of 72GB of VRAM for approximately $200. The breakthrough is a custom Linux kernel module that allows the system to hot-swap between fully loaded AI models in an average of just 0.3 milliseconds, with testing showing zero degradation over ten rapid swap cycles. Each of the six GPU dies can hold a different model persistently in memory.
The inference engine is written in pure C with zero Python dependencies, emphasizing performance and low-level control. In initial tests, the setup achieved 38 tokens per second decoding with a quantized RWKV-X 0.2B model. The project's goal is to eventually fill all eight slots on the mining board, creating a ultra-low-cost, multi-model inference server where models can be loaded and switched at will. This hack demonstrates a significant proof-of-concept for maximizing the utility of obsolete, high-VRAM hardware like mining rigs and server-grade K80s, which are otherwise challenging to utilize efficiently.
- Built with $200 of repurposed hardware: a BTC-S37 mining motherboard and 3 NVIDIA K80 cards (6 dies, 72GB VRAM).
- Custom Linux kernel module enables 0.3ms model switching with persistent model storage on each GPU die.
- Pure C inference engine delivers 38 tok/s on a test model, offering a blueprint for budget multi-model serving.
Why It Matters
It provides a blueprint for affordable, high-availability AI inference, making multi-model serving accessible with repurposed hardware.