trunk/fd1d1b0aab3da7bbe2fa92e81f5f6235979a6b87: [xpu][fix] Fix test_flash_attention_dynamic on XPU. (#178369)
Critical fix resolves symbolic dimension handling that blocked dynamic compilation on Intel's XPU hardware.
The PyTorch team has resolved a significant technical bug affecting Intel's XPU GPU backend, specifically in how it handles dynamic compilation for flash attention operations. The issue, documented in commit fd1d1b0, was causing test_flash_attention_dynamic to fail when using torch.compile with dynamic=True—a feature designed to generate flexible computational graphs that adapt to different input shapes. The failure prevented XPU users from benefiting from dynamic graph reuse, forcing unnecessary recompilations for varying sequence lengths.
The root cause was threefold. First, the XPU implementation used `.size(3)` instead of `.sym_size(-1)` when checking head dimensions, which incorrectly materialized symbolic dimensions into concrete values during FakeTensor tracing. Second, the meta dispatch path lacked proper XPU redirection, causing the system to default to slower math operations instead of optimized XPU kernels. Third, the SDPA constraint system only checked for CUDA tensors, completely bypassing alignment guards for XPU hardware. The fix aligns XPU's behavior with CUDA's by implementing proper symbolic dimension handling, adding missing dispatch paths, and extending hardware checks to include XPU.
This technical correction enables Intel GPU users to leverage PyTorch's full dynamic compilation capabilities for transformer models. With the fix applied, flash attention operations on XPU can now generate a single dynamic graph that efficiently handles inputs of varying sequence lengths, matching the performance characteristics previously available only on NVIDIA hardware. The commit also removes the @skipIfXpu decorator from the relevant test, confirming the issue is resolved.
- Fixed symbolic dimension materialization in XPU's check_flash_attention_head_dim_size by switching from .size(3) to .sym_size(-1)
- Added missing XPU dispatch path in _fused_sdp_choice_meta to ensure proper kernel selection during FakeTensor tracing
- Extended SDPA constraint guards from is_cuda to is_cuda or is_xpu to apply alignment-based recompilation correctly
Why It Matters
Enables efficient dynamic compilation for transformer models on Intel GPUs, reducing recompilation overhead and improving performance parity with CUDA.