Research & Papers

Probabilistic Joint and Individual Variation Explained (ProJIVE) for Data Integration

New probabilistic algorithm isolates shared patterns across brain scans and cognitive tests, revealing Alzheimer's biomarkers.

Deep Dive

A team of researchers including Raphiel Murden, Ganzhong Tian, Deqiang Qiu, and Benjamin Risk has published a new paper on arXiv introducing ProJIVE (Probabilistic Joint and Individual Variation Explained). This is a novel probabilistic model designed to tackle a core challenge in modern data science: integrating multiple types of data (like genomics, metabolomics, and neuroimaging) collected from the same subjects. ProJIVE extends the established Joint and Individual Variance Explained (JIVE) framework by providing a formal probabilistic foundation, using an expectation-maximization (EM) algorithm for maximum likelihood estimation. This allows it to simultaneously and more accurately decompose datasets into low-rank approximations of variation that is shared (joint) across all data types and variation that is unique (individual) to each.

The power of ProJIVE was demonstrated in a high-impact application: analyzing Alzheimer's disease. The researchers applied the model to integrated data on brain structure (morphometry) and cognitive test scores from the Alzheimer's Disease Neuroimaging Initiative (ADNI). ProJIVE successfully learned biologically meaningful patterns of variation. Crucially, the joint subject scores it derived from combining brain and cognition data showed a strong relationship with existing, more expensive biomarkers for the disease. This suggests ProJIVE can extract robust, integrative signals that were previously only accessible through costly separate analyses, potentially accelerating discovery in complex biomedical fields. The code for the analysis is publicly available on GitHub.

Key Points
  • ProJIVE uses a probabilistic EM algorithm to simultaneously estimate shared (joint) and unique (individual) patterns across multiple datasets, improving accuracy over previous methods.
  • Successfully applied to Alzheimer's research, it found integrated brain-cognition patterns strongly correlated with established biomarkers, validating its biological relevance.
  • Provides a unified, open-source framework for multi-modal data integration in fields like genomics and neuroimaging, where analyzing linked datasets is common.

Why It Matters

Enables cheaper, more accurate discovery of hidden relationships in complex biomedical data, accelerating research in diseases like Alzheimer's.