Open Source

Skymizer Taiwan Inc. Unveils Breakthrough Architecture Enabling Ultra-Large LLM Inference on a Single Card

384GB memory, 240W: run 700B-parameter LLMs locally without massive GPU clusters.

Deep Dive

Skymizer Taiwan Inc. unveiled a novel architecture that could reshape enterprise AI inference: a single PCIe card housing six HTX301 chips and 384 GB of memory, capable of running 700B-parameter model inference locally at just ~240W per card. This is a radical departure from current approaches that rely on multiple high-VRAM GPUs.

The key innovation is splitting the inference pipeline: GPUs handle the compute-dense prefill stage, while the HTX301 card exclusively manages decoding and model weights—the memory-bandwidth-intensive phase that dominates real-world latency. This allows enterprises to run massive models without chasing scarce, expensive GPUs. Real-world performance will be demonstrated at Computex in early June.

Key Points
  • Single PCIe card with six HTX301 chips and 384 GB memory enables 700B-parameter LLM inference
  • Power consumption is just ~240W per card, far less than multi-GPU setups
  • Splits inference: GPU handles prefill, HTX301 handles decode and model weights

Why It Matters

Democratizes large model inference by removing the need for massive GPU clusters, cutting cost and power.