The Core Idea
In 2013, Google researchers published Word2Vec, a technique that represents each word as a list of numbers (a “vector”) in a way that captures its meaning. The principle: “You are known by the company you keep.” Words that appear in similar contexts get similar vectors. “King” and “queen” end up close together in this mathematical space. “King” and “banana” end up far apart.
The Famous Example
Word2Vec produced a result that stunned the AI community:
King − Man + Woman = Queen
The mathematical relationships between word vectors captured real semantic relationships. “Paris − France + Italy = Rome.” “Bigger − Big + Small = Smaller.” For the first time, a machine had learned something resembling the meaning of words, not just their statistical co-occurrence.
Why It Changed Everything
Embeddings solved the synonym problem. A search for “automobile insurance” now returns results about “car insurance” because the system knows these words are semantically close. This is the foundation of semantic search — searching by meaning rather than exact keywords. It’s also the foundation for modern recommendation systems, document clustering, and the vector databases that power RAG systems (Chapter 18).
Key insight: Embeddings are arguably the single most important concept in modern AI. They convert any type of data — words, sentences, images, products, customers — into numerical vectors that capture meaning. When someone says “vector database” or “semantic search,” they’re talking about systems built on embeddings. This concept underpins everything from ChatGPT to enterprise knowledge management.