Research & Papers

Controlled Paraphrase Geometry in Sentence Embedding Space: Local Manifold Modeling and Latent Probing

300K controlled paraphrases mapped in embedding space using local manifold modeling.

Deep Dive

A new paper by Leonid Bedratyuk tackles the underexplored local geometry of sentence embeddings, specifically how semantically close sentences (paraphrases) cluster in high-dimensional space. The study introduces CoPaGE-300K, a dataset of 300,000 controlled template-based paraphrases with slot-level annotations and precomputed embeddings. Using affine, quadratic, and cubic models, Bedratyuk shows that nonlinear local manifolds describe embedding neighborhoods far better than linear approximations.

Beyond modeling, the paper proposes a surface-based latent probing method that generates synthetic points in PCA-reduced space while preserving fitted surface consistency, Hessian shape, and coefficient stability. Surprisingly, downstream classification experiments reveal that geometrically valid synthetic points do not automatically improve task performance. This result forces a distinction between geometric validity and discriminative utility, making the work a key reference for anyone building or analyzing sentence embedding systems.

Key Points
  • CoPaGE-300K contains 300,000 controlled paraphrase variants with slot-level annotations and precomputed embeddings.
  • Nonlinear local models (quadratic, cubic) describe sentence embedding clouds significantly better than affine models.
  • Surface-based latent probing achieves high geometric fidelity but does not directly translate to better classification performance.

Why It Matters

Challenges the assumption that better embedding geometry alone improves NLP, with implications for sentence representation analysis.