Research & Papers

New OFU metric gives instant GPU efficiency visibility at fleet scale

Zero instrumentation, 2% accuracy, catches 2.5x efficiency regressions across large GPU fleets.

Deep Dive

A team of researchers (Pedersen et al.) has unveiled Overall FLOP Utilization (OFU), a lightweight, hardware-level metric that gives operators instant visibility into GPU efficiency across massive AI fleets—without any code changes or instrumentation. OFU is derived from just two on-chip performance counters: Tensor Pipe Activity and SM clock frequency. This makes it precision-agnostic and backwards-compatible across GPU generations, from H100 to GB200, supporting numeric formats from FP16 through NVFP4.

In rigorous testing, the team demonstrated that after applying a simple tile-quantization correction, OFU predicts application-level Model FLOP Utilization (MFU) to within ≤2 percentage points across controlled GEMM benchmarks. Against 608 real-world training jobs, OFU achieved a correlation of r=0.78 with MFU and even surfaced two cases where frameworks were miscalculating FLOPs. Most impressively, in live deployment across large-scale GPU fleets, OFU immediately caught a 2.5x efficiency regression and tracked precision-dependent utilization shifts during mixed-precision pretraining. The researchers argue OFU is a practical, deployment-ready complement to MFU for continuous, fleet-wide efficiency monitoring.

Key Points
  • OFU uses only two hardware counters (Tensor Pipe Activity & SM clock frequency) – no app changes needed
  • After correction, predicts MFU within 2 percentage points (tested on H100/GB200 across FP16, TF32, FP8, NVFP4)
  • Deployed on production fleets, detected a 2.5x efficiency regression and revealed two FLOPs miscalculations in frameworks

Why It Matters

Enables operators to monitor GPU efficiency continuously at scale without instrumentation, catching regressions instantly.