CAMO: A Class-Aware Minority-Optimized Ensemble for Robust Language Model Evaluation on Imbalanced Data
New technique solves AI's 'majority bias' problem, improving performance on rare categories in datasets.
A research team from October University for Modern Science & Arts and Qatar University has introduced CAMO (Class-Aware Minority-Optimized), a novel ensemble technique designed to address a critical flaw in AI evaluation: performance degradation on imbalanced data. Traditional ensemble methods tend to favor majority classes in a dataset, severely hampering accuracy for minority categories—a common issue in real-world applications like medical diagnosis or fraud detection. CAMO employs a hierarchical procedure that incorporates vote distributions, confidence calibration, and inter-model uncertainty to dynamically boost predictions for underrepresented classes without sacrificing overall performance.
The researchers rigorously tested CAMO on two highly imbalanced, domain-specific benchmarks: the DIAR-AI/Emotion dataset and the ternary BEA 2025 dataset. They benchmarked it against seven established ensemble algorithms using eight different language models, including three Large Language Models (LLMs) and five Smaller Language Models (SLMs), under both zero-shot and fine-tuned settings. The results were clear: with refined models, CAMO consistently achieved the highest strict macro F1-score, setting a new benchmark. The study also revealed that CAMO's benefits work in concert with model adaptation, proving that the optimal ensemble choice depends on specific model properties. This positions CAMO as a reliable, domain-neutral framework for robust categorization where data is uneven, moving beyond synthetic balance to handle authentic, skewed distributions.
- CAMO uses a hierarchical process with vote distribution and confidence calibration to boost minority class performance.
- Tested on 8 models (3 LLMs, 5 SLMs), it beat 7 other ensemble methods, achieving top F1-score on imbalanced benchmarks.
- Proven effective on real-world datasets like DIAR-AI/Emotion, making AI evaluation more reliable for skewed data.
Why It Matters
Enables more accurate AI in critical real-world applications like healthcare and finance where data is naturally imbalanced.