Multi-Domain Causal Empirical Bayes Under Linear Mixing
New algorithm uses Empirical Bayes to extract causal structure from multi-domain data with known interventions.
A team from Columbia University, including Bohan Wu, Julius von Kügelgen, and David M. Blei, has published a new paper on arXiv titled 'Multi-Domain Causal Empirical Bayes Under Linear Mixing.' The research tackles a core challenge in Causal Representation Learning (CRL): moving from theoretical identifiability to practical estimation. The team proposes using an Empirical Bayes (EB) framework, specifically designed for simultaneous inference problems, to learn the underlying causal latent variables from high-dimensional, multi-domain observational data.
The novel algorithm operates under a linear measurement model where differences between data domains are modeled as interventions in a shared underlying acyclic Structural Causal Model (SCM). When the causal graph and intervention targets are known, the researchers developed an EM-style algorithm based on causally structured score matching. This 'f-modeling' approach improves the quality of learned causal variables by leveraging invariant structure both within individual domains and across them. In experiments on synthetic data, this proposed method demonstrated superior performance, achieving more accurate estimation than other existing techniques for CRL.
The work also contextualizes this EB 'f-modeling' approach against existing 'g-modeling' CRL methods. By formally bridging causal learning with the Empirical Bayes statistical framework, the research provides a new, principled estimation tool. This advancement is significant for fields that rely on extracting robust, interpretable causal factors from complex, multi-source data, such as genomics, computational social science, and any application where data is collected under varying conditions or interventions.
- Uses Empirical Bayes for simultaneous inference on multi-domain data, where domains represent interventions in a shared causal model.
- Proposes an EM-style 'f-modeling' algorithm based on causally structured score matching for known graph/intervention targets.
- Outperformed other CRL methods in synthetic experiments, providing more accurate estimation of latent causal variables.
Why It Matters
Provides a robust new method for extracting interpretable causal drivers from messy, real-world data collected across different environments or conditions.