Structured Exploration and Exploitation of Label Functions for Automated Data Annotation
New AI system generates training labels automatically, achieving near-perfect coverage and cutting annotation costs dramatically.
A research team led by Phong Lam, Ha-Linh Nguyen, and colleagues has introduced EXPONA, a novel framework that automates the creation of label functions (LFs) for training data. Manual data annotation is a major bottleneck in AI development, being costly and prone to error. Programmatic labeling uses LFs—heuristic rules like "if text contains 'refund,' label as customer complaint"—to generate weak labels at scale. Current methods often rely on large language models (LLMs) for surface-level rules or limited hand-crafted primitives, resulting in poor coverage and unreliable quality.
EXPONA addresses these limitations by formulating LF generation as a structured exploration and exploitation problem. It systematically explores LFs from three perspectives: surface (keywords, patterns), structural (syntax, dependencies), and semantic (meaning, context). The framework then applies reliability-aware mechanisms to filter out noisy or redundant heuristics while preserving complementary signals. This balanced approach ensures both high coverage and precision in the generated label set.
In extensive experiments across eleven diverse classification datasets, EXPONA consistently outperformed state-of-the-art automated LF generation methods. The framework achieved near-complete label coverage (up to 98.9%), improved weak label quality by up to 87%, and delivered downstream performance gains of up to 46% in weighted F1 scores for trained models. These results demonstrate that EXPONA's multi-level exploration and intelligent filtering produce more consistent, high-quality training data across various tasks.
The research, accepted by the Knowledge-Based Systems (KBS) Journal and available on arXiv, represents a significant advance in data-centric AI. By automating and improving programmatic labeling, EXPONA reduces dependency on expensive manual annotation and large language models for heuristic generation. This enables faster iteration on AI projects and more reliable model training, particularly in domains where labeled data is scarce or expensive to obtain.
- Achieves up to 98.9% label coverage across 11 diverse datasets, nearing complete automation of training data annotation
- Improves weak label quality by 87% and boosts downstream model performance by 46% in weighted F1 scores
- Uses multi-level exploration (surface, structural, semantic) and reliability-aware filtering to balance coverage and precision
Why It Matters
Dramatically reduces the cost and time of creating labeled training data, accelerating AI development across industries where annotation is a bottleneck.