Research & Papers

Wattlytics: A Web Platform for Co-Optimizing Performance, Energy, and TCO in HPC Clusters

Researchers launch an interactive tool to model GPU cluster costs, showing energy-efficient GPUs can beat high-performance ones.

Deep Dive

A team of researchers has launched Wattlytics, the first interactive, web-based platform designed to co-optimize the performance, energy consumption, and total cost of ownership (TCO) for modern GPU-accelerated high-performance computing (HPC) clusters. Unlike simple procurement calculators, Wattlytics integrates benchmark-driven performance data for GPUs like NVIDIA's GH200, H100, and L40S with sophisticated, dynamic voltage and frequency scaling (DVFS)-aware power models. This allows users to configure heterogeneous systems, select scientific workloads such as GROMACS or AMBER, and model multi-year operational costs under real-world variables like energy prices and system lifetime.

The platform's core innovation is its ability to compute multidimensional decision metrics, including work-per-TCO and work-per-watt-per-TCO, enabling direct comparisons between deployment strategies. Users can explore scenarios with fixed budgets, GPU counts, performance targets, or power caps. Case studies within the research paper demonstrate that under budget or energy constraints, optimally deployed energy-efficient GPUs can outperform higher-performance alternatives in overall cost-effectiveness. Wattlytics also includes sensitivity analysis tools like Sobol indices and Monte Carlo simulations to help users identify and manage risk-driving factors in their cluster designs.

By turning complex trade-offs into an interactive, explainable decision-making process, Wattlytics addresses the escalating challenge of balancing computational power with soaring energy costs and sustainability goals in large-scale computing.

Key Points
  • Integrates GPU performance scaling (GH200, H100, A100) with DVFS-aware power modeling and multi-year TCO analysis.
  • Enables scenario analysis under constraints like fixed budget or power, showing energy-efficient GPUs can be more cost-effective.
  • Provides sensitivity metrics (Sobol indices, Monte Carlo) and collaborative features for informed, risk-aware HPC cluster procurement.

Why It Matters

It empowers organizations to design cost-effective, sustainable HPC infrastructure by quantifying the complex trade-offs between performance, energy, and long-term costs.