Dictionary Based Pattern Entropy for Causal Direction Discovery
A new method combines algorithmic and Shannon information theory to find what causes what in symbolic data.
A team of researchers has introduced a novel framework called Dictionary Based Pattern Entropy (DPE) to tackle the challenging problem of discovering causal direction from temporal observational data, particularly for symbolic sequences where traditional functional models fail. The work, published on arXiv, interprets causation as the emergence of compact, rule-based patterns in a candidate cause that systematically constrain an effect variable. The DPE framework uniquely integrates principles from Algorithmic Information Theory (AIT) and Shannon Information Theory to construct direction-specific dictionaries and quantify their influence using entropy-based measures, creating a principled link between deterministic pattern structure and stochastic variability.
The core of the method is a minimum-uncertainty criterion, where the causal direction is inferred by selecting the one that exhibits stronger and more consistent pattern-driven organization. As summarized in their results, DPE consistently achieved reliable performance across diverse synthetic systems—including delayed bit-flip perturbations, AR(1) coupling, and 1D skew-tent maps—outperforming or matching existing AIT-based methods like ETC_E, ETC_P, and LZ_P. In real-world tests on biological and ecological datasets, its performance was competitive, though alternative methods showed advantages in specific genomic settings. The findings demonstrate that minimizing pattern-level uncertainty provides a robust, interpretable, and broadly applicable framework for causal discovery, moving beyond correlation to identify the underlying drivers of change in complex systems.
- Proposes Dictionary Based Pattern Entropy (DPE), a new framework combining Algorithmic and Shannon Information Theory for causal discovery.
- Infers direction by selecting the causal path that minimizes pattern-level uncertainty, outperforming methods like ETC_E and LZ_P on synthetic systems.
- Provides an interpretable method for symbolic sequences (e.g., biological, ecological data) where traditional noise/functional models are unavailable.
Why It Matters
Provides a robust, interpretable method to move beyond correlation and identify true causal drivers in complex symbolic data like genomics.