Word2Vec's output weights become word vectors: the hidden math explained
Why do output layer weights encode semantic features, not just prediction parameters?
Deep Dive
A Reddit user asks for an intuitive and mathematical explanation of why the hidden-to-output weight matrix in Word2Vec (CBOW or Skip-gram) learns meaningful word embeddings, as most resources state that the weights become embeddings but do not explain why. The user has explored multiple videos, blog posts, and ChatGPT without finding an explanation that clicks.
Key Points
- Output weights in Word2Vec effectively factorize the word-context co-occurrence matrix into low-dimensional vectors.
- The hidden-to-output layer's backpropagation directly correlates prediction accuracy with semantic similarity of co-occurring words.
- This mechanism explains why embeddings from trained Word2Vec models outperform random parameters for semantic tasks.
Why It Matters
Understanding why Word2Vec's weights become vectors helps practitioners tune embeddings and apply them to NLP tasks more effectively.