Skymizer Taiwan Inc. Unveils Breakthrough Architecture Enabling Ultra-Large LLM Inference on a Single Card
384GB memory, 240W: run 700B-parameter LLMs locally without massive GPU clusters.
Skymizer Taiwan Inc. unveiled a novel architecture that could reshape enterprise AI inference: a single PCIe card housing six HTX301 chips and 384 GB of memory, capable of running 700B-parameter model inference locally at just ~240W per card. This is a radical departure from current approaches that rely on multiple high-VRAM GPUs.
The key innovation is splitting the inference pipeline: GPUs handle the compute-dense prefill stage, while the HTX301 card exclusively manages decoding and model weights—the memory-bandwidth-intensive phase that dominates real-world latency. This allows enterprises to run massive models without chasing scarce, expensive GPUs. Real-world performance will be demonstrated at Computex in early June.
- Single PCIe card with six HTX301 chips and 384 GB memory enables 700B-parameter LLM inference
- Power consumption is just ~240W per card, far less than multi-GPU setups
- Splits inference: GPU handles prefill, HTX301 handles decode and model weights
Why It Matters
Democratizes large model inference by removing the need for massive GPU clusters, cutting cost and power.