Research & Papers

[P] PCA before truncation makes non-Matryoshka embeddings compressible: results on BGE-M3 [P]

A simple PCA rotation before truncation preserves 99.6% cosine similarity in compressed embeddings.

Deep Dive

A viral research post reveals a straightforward yet powerful technique for compressing dense embedding models that weren't specifically trained for dimensionality reduction, such as the popular BGE-M3. The core problem is that naively truncating the final dimensions of a standard embedding destroys its information-retrieval capabilities. The proposed solution is to fit a Principal Component Analysis (PCA) model once on a sample of embeddings, rotate all vectors into this new basis where variance is ordered, and then truncate. This concentrates the meaningful 'signal' into the leading components, making truncation non-arbitrary.

Results on a 1024-dimensional BGE-M3 sample are striking: compressing to 512 dimensions with PCA-first achieved a 0.996 cosine similarity versus 0.707 for naive truncation. At 256 dimensions, the gap widened to 0.974 vs. 0.467. The method also combines effectively with quantization. Applying 3-bit quantization after PCA compression to 384 dimensions achieved a 27.7x compression ratio with a 0.979 cosine similarity, creating a practical middle ground between high-quality scalar quantization and aggressive binary methods.

The analysis highlights a crucial caveat for real-world use: while cosine similarity metrics remain high even under aggressive compression, task-specific metrics like Recall@10 degrade more quickly. For the 27.7x compression setup, Recall@10 dropped to 76.4%, indicating the need to tune compression based on the end application's priority—perfect reconstruction versus top-tier retrieval accuracy. This work provides a immediately applicable, low-overhead method for developers needing to deploy large embedding models in resource-constrained environments.

Key Points
  • PCA rotation before truncation preserved 0.990 cosine similarity at 384 dimensions vs. 0.609 for naive truncation on BGE-M3.
  • Combining PCA with 3-bit quantization achieved 27.7x compression with a 0.979 cosine similarity, though Recall@10 dropped to 76.4%.
  • The method is a one-time preprocessing step, making it viable for compressing existing, non-Matryoshka-trained embedding models in production.

Why It Matters

Enables massive storage and cost savings for AI applications using embeddings, making advanced retrieval viable on smaller devices and budgets.