Research & Papers

Neighbor Embedding for High-Dimensional Sparse Poisson Data

New 'p-SNE' algorithm uses Poisson statistics to reveal hidden patterns in word counts, neural spikes, and emails.

Deep Dive

Researchers Noga Mudrik and Adam S. Charles have introduced p-SNE (Poisson Stochastic Neighbor Embedding), a novel algorithm designed to solve a specific but widespread problem in data science: visualizing and analyzing high-dimensional data where measurements are counts of events. This includes datasets like word counts in documents, neural spike counts, or daily email frequencies, which are sparse and follow a Poisson distribution. Traditional methods like PCA (linear) and t-SNE (nonlinear) assume data lives in a continuous Euclidean space, a poor fit for this discrete, often zero-heavy count data, leading to suboptimal or misleading low-dimensional embeddings.

p-SNE is built from the ground up for this data type. It uses Kullback-Leibler (KL) divergence between Poisson distributions to measure the dissimilarity between high-dimensional data points, a statistically sounder metric for count data. The algorithm then optimizes a low-dimensional embedding using Hellinger distance to preserve these Poisson-based relationships. In tests, p-SNE successfully recovered meaningful structures that other methods missed, such as weekly patterns in communication data, thematic clusters in academic papers, and clear stimulus-response gradients in neural recordings.

This work provides data scientists and researchers in fields like computational neuroscience, natural language processing, and social network analysis with a purpose-built tool for exploratory data analysis. By correctly modeling the statistical nature of event-count data, p-SNE can reveal latent patterns—like temporal drift or categorical groupings—that are essential for forming hypotheses and guiding further research, moving beyond the limitations of general-purpose dimensionality reduction techniques.

Key Points
  • New 'p-SNE' algorithm designed for sparse, count-based data (e.g., word counts, neural spikes) where PCA/t-SNE fail.
  • Uses KL divergence between Poisson distributions for dissimilarity, optimizing embeddings with Hellinger distance.
  • Proven on real data: found weekday email patterns, OpenReview paper clusters, and neural stimulus gradients.

Why It Matters

Provides a statistically correct tool for visualizing critical event-based data in neuroscience, NLP, and social science, revealing hidden patterns.