Research & Papers

MOCA: A Transformer-based Modular Causal Inference Framework with One-way Cross-attention and Cutting Feedback

This new method prevents data leakage in causal effect estimation using modular transformers.

Deep Dive

Researchers Lei Wang and Debashis Ghosh introduce MOCA (Modular One-way Causal Attention), a transformer-based framework designed to improve causal effect estimation from observational data. Traditional estimators like inverse probability weighting (IPW) and augmented IPW (AIPW) struggle with complex, non-linear, and high-dimensional data. While machine learning approaches offer more flexibility, they often suffer from a critical flaw: joint training can allow outcome-related information to leak into treatment-side representations, violating causal assumptions.

MOCA addresses this through a modular design that separates treatment and outcome modeling. A one-way cross-attention mechanism ensures information flows only from treatment to outcome modules, not vice versa. The cutting-feedback strategy, implemented via gradient detachment, prevents the outcome loss from updating the treatment module during training. This preserves directional information flow while leveraging transformers' representational power. Tests across linear, nonlinear, heavy-tailed, hidden confounding, and high-dimensional settings show MOCA matches or outperforms IPW, AIPW, X-learner, TARNet, and DragonNet. Real-world validation on the Infant Health and Development Program and Dehejia-Wahba datasets confirms its practical utility, making MOCA a promising, interpretable direction for causal inference with deep learning.

Key Points
  • MOCA uses a transformer-based modular design with one-way cross-attention to separate treatment and outcome modeling.
  • A cutting-feedback strategy via gradient detachment prevents outcome information from contaminating treatment-side representations.
  • Outperforms IPW, AIPW, X-learner, TARNet, and DragonNet across linear, nonlinear, heavy-tailed, hidden confounding, and high-dimensional scenarios.

Why It Matters

MOCA offers a more robust, interpretable approach to causal inference from messy real-world data.