FastText embeddings create word vectors that capture meaning. To check how good these vectors are, we use cosine similarity. It measures how close two word vectors are in meaning. A higher cosine similarity means words are more related. For tasks using FastText, like text classification, we also check accuracy or F1 score to see how well the model understands text using these embeddings.
FastText embeddings in NLP - Model Metrics & Evaluation
Start learning this pattern below
Jump into concepts and practice - no test required
| Predicted Positive | Predicted Negative |
|--------------------|--------------------|
| True Positive (TP) = 80 | False Negative (FN) = 20 |
| False Positive (FP) = 10 | True Negative (TN) = 90 |
Total samples = 80 + 20 + 10 + 90 = 200
From this matrix, we calculate:
- Precision = TP / (TP + FP) = 80 / (80 + 10) = 0.89
- Recall = TP / (TP + FN) = 80 / (80 + 20) = 0.80
- F1 Score = 2 * (Precision * Recall) / (Precision + Recall) ≈ 0.84
Imagine a spam detector using FastText embeddings:
- High Precision: Few good emails are wrongly marked as spam. Users don't miss important emails.
- High Recall: Most spam emails are caught. Less spam reaches the inbox.
Depending on what matters more, we adjust the model. For spam, high precision is often preferred to avoid losing good emails. For medical text classification, high recall is critical to catch all important cases.
Good metrics mean the embeddings help the model understand text well:
- Good: Accuracy > 85%, F1 score > 0.8, cosine similarity between related words > 0.7
- Bad: Accuracy < 60%, F1 score < 0.5, cosine similarity between related words < 0.3
Bad values suggest embeddings do not capture meaning well or the model is not learning properly.
- Accuracy paradox: High accuracy can be misleading if classes are imbalanced.
- Data leakage: Using test data during training inflates metrics falsely.
- Overfitting: Very high training accuracy but low test accuracy means the model memorizes instead of generalizing.
- Ignoring semantic similarity: Only checking classification metrics misses how well embeddings capture word meaning.
Your text classification model using FastText embeddings has 98% accuracy but only 12% recall on the positive class. Is it good for production? Why or why not?
Answer: No, it is not good. The low recall means the model misses most positive cases, which can be critical depending on the task. High accuracy alone is misleading if the positive class is rare.
Practice
Solution
Step 1: Understand FastText's approach to word representation
FastText breaks words into smaller parts called n-grams, which helps it learn better representations for rare or misspelled words.Step 2: Compare with traditional embeddings
Traditional embeddings like Word2Vec treat words as whole units and cannot handle unseen or misspelled words well.Final Answer:
It considers subword information to handle rare or misspelled words. -> Option AQuick Check:
FastText uses subwords = A [OK]
- Thinking FastText ignores subwords
- Confusing FastText with image embeddings
- Assuming FastText stores full sentences
Solution
Step 1: Identify the correct Gensim function for FastText pretrained vectors
Gensim uses KeyedVectors.load_word2vec_format with binary=True to load FastText pretrained vectors in .bin format.Step 2: Check other options for correctness
model = gensim.models.FastText.load_fasttext_format('cc.en.300.bin') uses a non-existent method. model = gensim.models.Word2Vec.load('cc.en.300.bin') loads Word2Vec models, not FastText. model = gensim.load('fasttext_model.bin') is invalid syntax.Final Answer:
model = gensim.models.KeyedVectors.load_word2vec_format('cc.en.300.bin', binary=True) -> Option DQuick Check:
Use KeyedVectors.load_word2vec_format for FastText .bin [OK]
- Using Word2Vec.load for FastText files
- Calling non-existent load_fasttext_format method
- Forgetting binary=True for .bin files
from gensim.models import FastText sentences = [['cat', 'sat', 'on', 'mat'], ['dog', 'barked']] model = FastText(sentences, vector_size=10, window=3, min_count=1, epochs=5) print(model.wv['cat'])What will be the output type of
model.wv['cat']?Solution
Step 1: Understand what model.wv['word'] returns in Gensim FastText
model.wv['cat'] returns the vector embedding as a numpy array representing the word 'cat'.Step 2: Check other options for output type
A list of words similar to 'cat' is for similar words, not the vector. An integer representing the frequency of 'cat' is frequency, which is not returned here. A string with the word 'cat' is just the word string, not the vector.Final Answer:
A numpy array representing the vector embedding of 'cat' -> Option AQuick Check:
model.wv['word'] returns vector array [OK]
- Expecting a list of similar words instead of vector
- Thinking it returns frequency count
- Confusing word string with vector
Solution
Step 1: Understand FastText's ability with unseen words
FastText can generate vectors for unseen words by using subword information, unlike Word2Vec.Step 2: Identify cause of KeyError
If you get KeyError for unseen words, likely you trained or loaded a Word2Vec model, not FastText.Final Answer:
You used Word2Vec instead of FastText; switch to FastText to handle unseen words. -> Option BQuick Check:
Use FastText (not Word2Vec) for unseen words [OK]
- Assuming FastText can't handle unseen words
- Trying to fix by increasing epochs only
- Ignoring model type mismatch
Solution
Step 1: Identify how FastText handles misspelled words
FastText uses subword (character n-gram) information, so it can create embeddings for misspelled or rare words.Step 2: Choose the best approach to leverage this feature
Training FastText on your dataset with subword info enabled and using its vectors as features helps the model understand misspellings better.Final Answer:
Train FastText on your dataset with subword information enabled and use its vectors as input features. -> Option CQuick Check:
Train FastText with subwords for misspellings [OK]
- Using Word2Vec ignoring misspellings
- Replacing misspellings with tokens loses info
- Using one-hot encoding loses semantic info
