Developer Tools

Optimizing Effective Training Time for Meta’s Internal Recommendation/Ranking Workloads

Meta's 'Effective Training Time' metric now exceeds 90%, slashing wasted compute in recommendation model training.

Deep Dive

Meta's engineering teams have achieved a major infrastructure milestone, increasing 'Effective Training Time' (ETT%) to over 90% for their massive internal recommendation and ranking workloads. ETT% measures the percentage of total job time spent actually consuming new training data versus overhead like initialization, checkpointing, and failures. Starting in H2 2024, the team proactively analyzed fleetwide ETT, identifying key bottlenecks and developing more than 40 targeted technologies to tackle them. This systematic approach transformed how Meta manages one of the world's largest AI training fleets.

The optimization focused on four critical areas: slashing 'Time to Start' (hardware setup, launcher init, PyTorch 2 compilation), improving 'Time to Recover' after failures, streamlining checkpoint management to minimize idleness, and reducing overall failure rates. Specific wins include PyTorch 2 compilation optimizations that cut compile time and recompilation, and a shift to using CPU machines instead of GPUs for model publishing, saving precious GPU hours. These improvements, many of which are available in open-source projects like TorchRec and PyTorch, directly address common industry pain points where scaling workloads makes infrastructure overhead increasingly dominant.

Key Points
  • Meta's 'Effective Training Time' (ETT%) metric now exceeds 90% for offline training, a major efficiency milestone.
  • The team developed over 40 technologies targeting four areas: Time to Start/Recover, Checkpoint Management, and Failure Reduction.
  • Key optimizations include PyTorch 2 compilation improvements and using CPUs instead of GPUs for model publishing, saving GPU hours.

Why It Matters

For companies scaling AI, reducing non-training overhead is critical for ROI. Meta's blueprint cuts waste and costs.