LLM Bootcamp - Module 3 - Evolution of Embeddings
In this module, we will explore the evolution of embeddings and how they have transformed the way we represent and work with textual data. From simple one-hot encoding to modern semantic embeddings, the methods used to convert text into numerical vectors have evolved significantly. We will cover the core techniques that have been developed, focusing on how they contribute to understanding and processing language in Natural Language Processing (NLP) tasks.
1. Review of Classical Techniques
In the early days of NLP and machine learning, text was represented in simple, often sparse, forms that lacked the ability to capture rich contextual meanings. These classical techniques served as the foundational building blocks for modern embedding methods.
1.1. Binary/One-Hot Encoding
One of the most basic methods for encoding text data is binary/one-hot encoding. In this technique, each unique word in the vocabulary is represented by a binary vector. The vector is as long as the vocabulary size, and only one element is set to "1" (the position corresponding to the word), with all others set to "0."
Advantages: Simple and easy to implement.
Limitations:
Very sparse representation.
Fails to capture relationships or semantics between words (e.g., "king" and "queen" are represented as entirely separate vectors).
1.2. Bag-of-Words (BoW)
The bag-of-words (BoW) model represents text as a collection of words, ignoring grammar and word order but maintaining multiplicity. A vector is created by counting the frequency of each word in the text corpus.
Advantages: Straightforward, easy to understand and implement.
Limitations:
Loses the context of word order.
High dimensionality due to the vocabulary size, often resulting in sparse vectors.
1.3. TF-IDF (Term Frequency-Inverse Document Frequency)
TF-IDF is an improvement over BoW, where the frequency of terms is weighted by how commonly they appear across all documents. It helps to capture the importance of a word within a particular document relative to the entire corpus.
Term Frequency (TF): Measures how frequently a term occurs in a document.
Inverse Document Frequency (IDF): Weighs down the terms that are common across all documents.
Advantages:
Reduces the impact of common words like “the”, “and”, etc.
Provides more meaningful vector representations of documents.
Limitations:
Still ignores word context and relationships.
Key Takeaways:
Classical techniques like one-hot encoding, BoW, and TF-IDF have provided the foundation for text vectorization.
They are efficient for simple tasks but are limited in capturing the meaning, context, and relationships in text.
2. Capturing Local Context with N-Grams and Challenges
In addition to the simple techniques, we also use n-grams to capture some level of local context. An n-gram is a contiguous sequence of n items (words or characters) from a given text.
2.1. What are N-Grams?
Unigrams: Single words (e.g., "dog", "cat").
Bigrams: Pairs of consecutive words (e.g., "big dog", "smart cat").
Trigrams: Triplets of consecutive words (e.g., "a big dog").
N-grams allow us to capture local context by considering pairs or triples of words, offering a more nuanced representation than single words.
2.2. Challenges of N-Grams
Increased Dimensionality: With higher n-grams, the vector space becomes sparse and high-dimensional.
Limited Context: N-grams still only capture local context (e.g., bi-grams for "smart cat" don’t reflect broader context beyond adjacent words).
Key Takeaways:
N-grams offer a way to capture local context but still have limitations in scalability and capturing longer-range dependencies between words.
3. Semantic Encoding Techniques
With the limitations of classical methods, we turn to more advanced techniques that capture semantic meaning—the true meaning behind words based on their context.
3.1. Word2Vec and Dense Word Embeddings
One of the most influential developments in embedding techniques is Word2Vec, which uses dense embeddings. Unlike one-hot encoding, where each word is represented as a sparse vector, Word2Vec learns a fixed-length vector for each word, with similar words having similar vector representations.
Word2Vec has two architectures:
Continuous Bag of Words (CBOW): Predicts the target word based on surrounding context.
Skip-gram: Predicts surrounding words based on a target word.
3.2. Application of Word2Vec in Text Analytics and NLP Tasks
Text Classification: Categorizing documents into predefined labels.
Named Entity Recognition (NER): Identifying entities like person names, dates, locations, etc.
Machine Translation: Translating sentences between languages.
Word2Vec significantly improves the ability to understand word meanings and semantic relationships by learning from context and usage.
Key Takeaways:
Word2Vec captures word semantics by representing similar words in similar vector spaces.
It has revolutionized text analytics by enabling more accurate NLP tasks.
4. Text Embeddings
As we continue to move toward more powerful models, we can extend embeddings to represent entire texts, not just individual words. This section covers how we can create sentence-level and document-level embeddings.
4.1. Word and Sentence Embeddings
Word Embeddings: Represent individual words, as seen in Word2Vec and other models like GloVe.
Sentence Embeddings: Represent entire sentences by aggregating word embeddings or using advanced models like BERT (Bidirectional Encoder Representations from Transformers), which captures both the meaning of words and the context of sentences.
4.2. Text Similarity Measures
Once we have embeddings, we can measure how similar two texts are. Common methods include:
Dot Product: Measures the similarity by taking the dot product of two vectors. The higher the value, the more similar the vectors.
Cosine Similarity: Measures the cosine of the angle between two vectors. A value closer to 1 indicates high similarity, while a value closer to 0 indicates low similarity.
Inner Product: Similar to the dot product, often used in specific deep learning models.
Key Takeaways:
Text embeddings extend the power of word embeddings to represent entire texts (sentences, paragraphs, etc.).
Text similarity measures like cosine similarity and dot product are commonly used to quantify the closeness between two embeddings.
5. Hands-on Exercise
5.1. Creating TF-IDF Embeddings on a Document Corpus
In this exercise, learners will use the TF-IDF method to create document embeddings. The goal is to apply TF-IDF to a small document corpus and generate vectors that reflect the importance of terms within each document.
Steps:
Preprocess your document corpus (e.g., tokenization, removing stopwords).
Calculate the TF-IDF score for each term in each document.
Create vectors that represent the document by its term weights.
5.2. Calculating Similarity Between Sentences Using Cosine Similarity and Dot Product
Once embeddings are created, learners will calculate the similarity between two or more sentences using the cosine similarity and dot product techniques.
Steps:
Compute embeddings for each sentence in the dataset.
Use cosine similarity to compare sentence pairs.
Use the dot product method to compare sentence vectors.
Key Takeaways:
The exercise allows learners to put the theory into practice by creating their own embeddings and measuring similarity.
Conclusion
The evolution of embeddings has transformed the field of Natural Language Processing. We began with simple vectorization methods like one-hot encoding and TF-IDF, which were limited in capturing meaning. As we explored Word2Vec and other semantic techniques, embeddings have evolved to capture the true meaning of words and sentences. With modern methods like sentence embeddings and BERT, we can now work with entire documents, allowing for more nuanced understanding and analysis. By completing hands-on exercises, learners can practice applying these techniques to real-world datasets and improve their text analytics workflows.