Developer Tools

PyTorch patches oneDNN convolution regression, 2x speedup for dense weights

MAML Omniglot shape sees 284→132 us after prop kind fix

Deep Dive

A recent PyTorch commit (28b4992e) aimed to boost channels-last and prepacked-weight inference by requesting oneDNN forward_inference for mkldnn_convolution_pointwise. However, that change inadvertently regressed performance for dense contiguous runtime weights—common in dynamic-shape Inductor workloads like MAML Omniglot. The oneDNN library selected a brg_conv_fwd inference primitive with an NHWC-like layout for dense inputs, forcing PyTorch to convert results back to dense, which was much slower than the previous forward/forward_training path.

The fix narrows the optimization: forward_inference is now only used for channels-last/MKLDNN-layout paths where the primitive matches the user layout. Dense contiguous pointwise convolution falls back to the default forward prop kind (training), restoring pre-regression speeds. Benchmark data confirms contiguous no_grad went from 284.6 us to 132.2 us, while channels-last remained fast at ~95 us. The patch also passes relevant fusion tests and avoids a broad revert that would lose the original inference win. This targeted approach fixes the root cause without a shape denylist.

Key Points
  • Regression introduced by forcing oneDNN forward_inference for all convolution layouts
  • Fix restores forward_training for dense contiguous weights, improving speed 2.15x
  • Channels-last and prepacked inference remain optimized via forward_inference path

Why It Matters

PyTorch developers get faster dynamic-shape inference without sacrificing channels-last performance, critical for training and deployment.