Research & Papers

SCOPE-FE: Structured Control of Operator and Pairwise Exploration for Feature Engineering

New method cuts combinatorial explosion before feature generation begins

Deep Dive

Automatic feature engineering (FE) is a proven way to boost tabular model accuracy, but expand-and-reduce approaches like OpenFE suffer from combinatorial explosion as input dimensionality grows. A new paper from Minhee Park, Seongyeon Son, Yonghyun Lee, and Eunchan Kim introduces SCOPE-FE, a structured search space control framework that tackles this head-on. Instead of generating all possible feature combinations and then pruning, SCOPE-FE reduces the candidate space prior to generation. It jointly controls two growth sources: operator space (via OperatorProbing, which estimates dataset-specific operator utility and discards low-value ones) and feature-pair space (via FeatureClustering, which uses spectral embedding and fuzzy c-means to group related features and restricts generation to within-cluster pairs). A third component, ReliabilityScoring, incorporates variance across subsamples to make pruning decisions more robust.

Experiments across ten benchmark datasets show SCOPE-FE delivers substantial efficiency gains—especially on high-dimensional data—while maintaining competitive predictive performance relative to existing baselines. The authors argue that structured control of the search space is a scalable alternative to brute-force expansion. This is a practical advance for data scientists and ML engineers who need to run automated feature engineering on real-world datasets without blowing up compute budgets. Code will be released upon acceptance. The paper is available on arXiv (2604.27025).

Key Points
  • OperatorProbing eliminates low-utility operators before feature generation, cutting search space
  • FeatureClustering uses spectral embedding and fuzzy c-means to restrict feature pairs to within-cluster combinations
  • ReliabilityScoring stabilizes pruning via variance across subsamples; tested on 10 benchmarks

Why It Matters

Saves compute time for high-dimensional tabular data without sacrificing model accuracy