Research & Papers

Position bias in dense retrievers is learned from data, study finds

New study shows 57-87% reduction in position bias with balanced training.

Deep Dive

A new study from researchers Daegon Yu, SeungYoon Han, and Woomyoung Park investigates whether position bias in dense retrievers is inherent to their architecture or learned from training data. Dense retrievers often favor documents where query-relevant information appears near the beginning, hurting performance when evidence is later in the text. The team constructed synthetic position-targeted training sets where evidence consistently appeared at the beginning, middle, or end, then fine-tuned eight architecturally diverse pretrained models (including variations of BERT) under both skewed and balanced distributions. The results were clear: skewed training led all models to favor documents with evidence at the corresponding position, showing that training data distribution is a major driver of retrieval-level bias.

When training data was balanced across positions, positional sensitivity dropped by 57-87% on position-aware benchmarks, while mean retrieval performance remained competitive in controlled settings. Representation-level analyses further revealed that fine-tuning reshapes learned positional preferences, though some pre-existing architectural or pretraining tendencies persisted in certain models. Overall, the study identifies training-position distribution as a key controllable factor in position bias and recommends balanced data curation as an effective mitigation strategy. For practitioners building search systems or RAG pipelines, this means careful data design can reduce unfair positional bias without sacrificing accuracy.

Key Points
  • Skewed training distributions caused all 8 architecturally diverse models to favor evidence at the corresponding document position.
  • Balanced training reduced positional sensitivity by 57-87% on position-aware benchmarks while maintaining competitive retrieval performance.
  • Representation-level analysis showed fine-tuning reshapes learned preferences, but some pre-existing tendencies from architecture or pretraining persisted.

Why It Matters

Data curation can dramatically reduce unfair position bias in dense retrievers, improving retrieval fairness and performance in search and RAG systems.