Research & Papers

Dynamic batching for Encoder-Decoder MT training or generation when long sequence caps the batch size [P]

r/MachineLearning April 28, 2026

⚡Dynamic batching adapts to sequence length, cutting OOM errors and accelerating throughput.

Deep Dive

A developer named bendangnuksung created dynabatch, a custom PyTorch sampler, to solve a common bottleneck in encoder-decoder model training: fixed batch sizes limited by the longest sequences. While fine-tuning a NLLB-200 600M model on an RTX 5090, the maximum fixed batch size was just 8 before hitting OOM errors. Monitoring with nvidia-smi revealed that most batches underutilized the GPU, as only the longest source/target pairs stressed memory. Dynabatch addresses this by sorting examples by token length (longest first), treating the first batch as a memory baseline, then using an XGB regressor to predict memory pressure for larger candidate batch sizes on shorter sequences, selecting the largest that stays under a safety threshold. This approach yielded a 3.3x throughput improvement in training benchmarks on the RTX 5090, though gains were smaller on a T4 GPU (1.06x-1.21x) for generation tasks. The tool includes a fallback for regressor overestimates that cause OOM, and the training notebooks are open-sourced for transparency. Designed primarily for encoder-decoder models like NLLB-200 for machine translation, where source length is a proxy for target length, dynabatch is less suited for decoder-only models, where sequence packing is preferred. The repo is on GitHub and PyPI.

Key Points

Dynabatch dynamically adjusts batch size based on sequence length, boosting throughput 3.3x on RTX 5090 for NLLB-200 600M training.
Uses an XGB regressor to predict GPU memory pressure, with a fallback for OOM overestimates.
Optimized for encoder-decoder MT models; less effective on decoder-only models or T4 GPUs (1.06x-1.21x gain).

Why It Matters

Enables faster fine-tuning of encoder-decoder models on consumer GPUs, reducing OOM errors and improving resource utilization for MT tasks.

Read Original Article

Dynamic batching for Encoder-Decoder MT training or generation when long sequence caps the batch size [P]

Why It Matters

Stay Ahead in AI