Dynamic batching for Encoder-Decoder MT training or generation when long sequence caps the batch size [P]
Dynamic batching adapts to sequence length, cutting OOM errors and accelerating throughput.
A developer named bendangnuksung created dynabatch, a custom PyTorch sampler, to solve a common bottleneck in encoder-decoder model training: fixed batch sizes limited by the longest sequences. While fine-tuning a NLLB-200 600M model on an RTX 5090, the maximum fixed batch size was just 8 before hitting OOM errors. Monitoring with nvidia-smi revealed that most batches underutilized the GPU, as only the longest source/target pairs stressed memory. Dynabatch addresses this by sorting examples by token length (longest first), treating the first batch as a memory baseline, then using an XGB regressor to predict memory pressure for larger candidate batch sizes on shorter sequences, selecting the largest that stays under a safety threshold. This approach yielded a 3.3x throughput improvement in training benchmarks on the RTX 5090, though gains were smaller on a T4 GPU (1.06x-1.21x) for generation tasks. The tool includes a fallback for regressor overestimates that cause OOM, and the training notebooks are open-sourced for transparency. Designed primarily for encoder-decoder models like NLLB-200 for machine translation, where source length is a proxy for target length, dynabatch is less suited for decoder-only models, where sequence packing is preferred. The repo is on GitHub and PyPI.
- Dynabatch dynamically adjusts batch size based on sequence length, boosting throughput 3.3x on RTX 5090 for NLLB-200 600M training.
- Uses an XGB regressor to predict GPU memory pressure, with a fallback for OOM overestimates.
- Optimized for encoder-decoder MT models; less effective on decoder-only models or T4 GPUs (1.06x-1.21x gain).
Why It Matters
Enables faster fine-tuning of encoder-decoder models on consumer GPUs, reducing OOM errors and improving resource utilization for MT tasks.