OFU uses only two hardware counters (Tensor Pipe Activity & SM clock frequency) – no app changes needed?

OFU uses only two hardware counters (Tensor Pipe Activity & SM clock frequency) – no app changes needed

After correction, predicts MFU within 2 percentage points (tested on H100/GB200 across FP16, TF32, FP8, NVFP4)?

After correction, predicts MFU within 2 percentage points (tested on H100/GB200 across FP16, TF32, FP8, NVFP4)

Deployed on production fleets, detected a 2.5x efficiency regression and revealed two FLOPs miscalculations in frameworks?

Deployed on production fleets, detected a 2.5x efficiency regression and revealed two FLOPs miscalculations in frameworks

Research & Papers

New OFU metric gives instant GPU efficiency visibility at fleet scale

arXiv cs.DC May 21, 2026

⚡Zero instrumentation, 2% accuracy, catches 2.5x efficiency regressions across large GPU fleets.

Deep Dive

A team of researchers (Pedersen et al.) has unveiled Overall FLOP Utilization (OFU), a lightweight, hardware-level metric that gives operators instant visibility into GPU efficiency across massive AI fleets—without any code changes or instrumentation. OFU is derived from just two on-chip performance counters: Tensor Pipe Activity and SM clock frequency. This makes it precision-agnostic and backwards-compatible across GPU generations, from H100 to GB200, supporting numeric formats from FP16 through NVFP4.

In rigorous testing, the team demonstrated that after applying a simple tile-quantization correction, OFU predicts application-level Model FLOP Utilization (MFU) to within ≤2 percentage points across controlled GEMM benchmarks. Against 608 real-world training jobs, OFU achieved a correlation of r=0.78 with MFU and even surfaced two cases where frameworks were miscalculating FLOPs. Most impressively, in live deployment across large-scale GPU fleets, OFU immediately caught a 2.5x efficiency regression and tracked precision-dependent utilization shifts during mixed-precision pretraining. The researchers argue OFU is a practical, deployment-ready complement to MFU for continuous, fleet-wide efficiency monitoring.

Key Points

OFU uses only two hardware counters (Tensor Pipe Activity & SM clock frequency) – no app changes needed
After correction, predicts MFU within 2 percentage points (tested on H100/GB200 across FP16, TF32, FP8, NVFP4)
Deployed on production fleets, detected a 2.5x efficiency regression and revealed two FLOPs miscalculations in frameworks

Why It Matters

Enables operators to monitor GPU efficiency continuously at scale without instrumentation, catching regressions instantly.

Read Original Article

New OFU metric gives instant GPU efficiency visibility at fleet scale

Why It Matters

Related Articles

🚀 Stay Ahead in AI