PyTorch fixes Triton bmm kernel launch on non-current CUDA device
A missing device guard caused vLLM multimodal tests to fail with pointer errors...
PyTorch has patched a bug in its batched matrix multiplication (bmm) outer-product implementation for CUDA. The Python native override, designed for inputs shaped like (B, M, 1) x (B, 1, N), launches a Triton kernel instead of the standard C++ ATen CUDA path. Unlike the generated C++ path, this Python dispatch did not include an automatic CUDA device guard before kernel launch. In vLLM multimodal model hooks, rotary embedding buffers and inputs running on cuda:1 could cause the Triton kernel to launch on the wrong device and stream, rejecting the pointers with a ValueError.
The fix, contributed by jansel and approved by slayton58, introduces a conditional device guard when the input device is not already current. This avoids the overhead of a Python context manager for the common case where the device matches. Benchmarking 1000 repeated bmm calls on (32,128,1) x (32,1,512) tensors showed median times of 24.210 us/call before and 24.946 us/call afterβa negligible 0.7 us overhead. The patch also requires both bmm inputs to be on the same CUDA device, falling back to native bmm otherwise. Testing with minimal reproducers and the existing test suite (11 passed) confirms the fix works correctly.
- Missing CUDA device guard in PyTorch's bmm outer-product Triton kernel caused failures when inputs were on a non-current device (e.g., cuda:1 with process on cuda:0).
- Fix adds a conditional device guard only when needed, adding just 0.7 us overhead per call in benchmarks.
- Both bmm inputs must now be on the same CUDA device; mismatched inputs fall back to native bmm with standard error reporting.
Why It Matters
This fix prevents silent data corruption and crashes in multi-GPU PyTorch workflows, especially complex model orchestrations like vLLM.