Developer Tools

Expanse (YC P26) uses AI to reclaim wasted GPU cluster capacity

59% of GPU compute wasted in HPC clusters – Expanse's AI predicts actual needs with 8x more accuracy.

Deep Dive

Expanse (YC P26) addresses a massive inefficiency in HPC and AI clusters: users over-request resources by 2–3x to avoid job crashes, resulting in 30–40% effective utilization. In one national-scale HPC cluster, 59% of compute was wasted across 122k jobs—equivalent to $8.5M/month at cloud rates. Built by a team with experience at top quant funds and HPC facilities, Expanse uses deep learning models that ingest job source code, submission scripts, and live hardware telemetry (DCGM, CUPTI, Cgroups) to predict actual resource needs at submission time. The model outperforms historical averages by 34% and frontier LLMs like GPT-4 by 8x on prediction accuracy. It also provides uncertainty estimates (p90 values) so users can choose their risk tolerance.

Expanse integrates natively with SLURM and Kubernetes, requiring no changes to job submission workflows. It offers three capabilities: (1) Resource prediction—predicts GPU VRAM, memory, CPU, and walltime with confidence intervals; (2) Live observability—a dashboard showing real-time hardware telemetry and code stack profiling with low overhead; (3) Failure diagnosis—when a job fails, it correlates stack traces with telemetry to output one- or two-line explanations with code-level fix suggestions. The system fine-tunes cluster-specific models over time, becoming sharper with use. For AI labs, quant firms, and manufacturing, Expanse promises to unlock significant wasted capacity, reduce costs, and improve job reliability.

Key Points
  • 59% of compute wasted in one national HPC cluster (122k jobs, $8.5M/month) due to over-provisioning.
  • Expanse's AI model predicts actual resource needs with 34% better accuracy than baselines and 8x better than GPT-4.
  • Integrates with SLURM/K8s to provide resource predictions, live observability, and failure diagnosis with code-level fixes.

Why It Matters

For AI labs and HPC facilities, Expanse could unlock millions in wasted compute, cutting costs and carbon footprint.