Research & Papers

REALITrees: Rashomon Ensemble Active Learning for Interpretable Trees

New method replaces random ensembles with exact enumeration of all near-optimal sparse decision trees.

Deep Dive

A research team including Simon D. Nguyen, Hayden McTavish, and renowned interpretable ML expert Cynthia Rudin has introduced REALITrees, a new paradigm for active learning. The framework, detailed in a March 2026 arXiv paper, fundamentally shifts away from the dominant Query-by-Committee (QBC) approach. Instead of inducing model disagreement through random feature subsetting or data blinding, REALITrees constructs its committee by exhaustively enumerating the 'Rashomon Set'—the complete collection of all near-optimal models for a given problem. This provides a direct characterization of the plausible hypothesis space, moving beyond mere approximations of epistemic uncertainty.

To manage the functional redundancy within this potentially large set, the team employs a PAC-Bayesian framework, using a Gibbs posterior to weight each committee member by its empirical risk. Crucially, they leverage recent algorithmic advances to perform this exact enumeration specifically for the class of sparse decision trees, a key model for interpretable machine learning. In benchmark tests against established active learning baselines, REALITrees consistently outperformed randomized ensembles. The performance gains were particularly pronounced in moderately noisy environments, where the method strategically leverages the expanded model multiplicity inherent in the Rashomon Set to achieve significantly faster convergence, reducing the cost and time of data labeling campaigns.

Key Points
  • Replaces random perturbation in Query-by-Committee with exact enumeration of the Rashomon Set of all near-optimal models.
  • Uses a PAC-Bayesian Gibbs posterior to weight committee members, addressing functional redundancy within the enumerated set.
  • Outperforms randomized ensembles in benchmarks, achieving faster convergence, especially in moderately noisy data environments.

Why It Matters

This makes active learning for interpretable models like decision trees more efficient and theoretically grounded, reducing data labeling costs.