Bird
Raised Fist0
NLPml~20 mins

FastText embeddings in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - FastText embeddings
Problem:You want to create word embeddings using FastText to capture word meanings including subword information. The current model uses FastText embeddings trained on a small dataset but the embeddings do not generalize well to unseen words.
Current Metrics:On a word similarity task, the current FastText embeddings achieve a Spearman correlation of 0.55.
Issue:The embeddings overfit the small training data and do not represent rare or unseen words well, leading to poor generalization.
Your Task
Improve the FastText embeddings so that the Spearman correlation on the word similarity task increases to at least 0.70, showing better generalization to unseen words.
You can only change FastText training hyperparameters and training data size.
You cannot switch to a different embedding model.
You must keep the training code runnable with gensim library.
Hint 1
Hint 2
Hint 3
Hint 4
Solution
NLP
from gensim.models import FastText

# Sample larger training data
sentences = [
    ['machine', 'learning', 'is', 'fun'],
    ['fasttext', 'embeddings', 'capture', 'subword', 'information'],
    ['word', 'vectors', 'help', 'in', 'nlp'],
    ['deep', 'learning', 'models', 'require', 'lots', 'of', 'data'],
    ['natural', 'language', 'processing', 'is', 'exciting'],
    ['fasttext', 'can', 'handle', 'rare', 'words'],
    ['subword', 'information', 'improves', 'embedding', 'quality'],
    ['more', 'data', 'helps', 'reduce', 'overfitting'],
    ['training', 'for', 'more', 'epochs', 'improves', 'results'],
    ['gensim', 'makes', 'training', 'fasttext', 'easy']
]

# Train FastText model with improved hyperparameters
model = FastText(
    sentences,
    vector_size=50,      # increased vector size
    window=3,            # context window size
    min_count=1,         # include all words
    sg=1,                # skip-gram model
    epochs=20,           # more training epochs
    min_n=3,             # min length of char ngrams
    max_n=6              # max length of char ngrams
)

# Example: get vector for a word
vector = model.wv['fasttext']

# Evaluate on a small word similarity set
# (word pairs and human scores)
word_pairs = [('machine', 'learning'), ('fasttext', 'embedding'), ('deep', 'model'), ('rare', 'words')]
human_scores = [0.9, 0.8, 0.7, 0.6]

from scipy.stats import spearmanr

model_scores = []
for w1, w2 in word_pairs:
    v1 = model.wv[w1]
    v2 = model.wv[w2]
    sim = v1.dot(v2) / ((v1**2).sum()**0.5 * (v2**2).sum()**0.5)
    model_scores.append(sim)

correlation, _ = spearmanr(human_scores, model_scores)
print(f"Spearman correlation: {correlation:.2f}")
Increased training data size by adding more example sentences.
Increased vector size from default to 50 for richer embeddings.
Set min_count=1 to include all words, even rare ones.
Used skip-gram model (sg=1) for better quality embeddings.
Increased training epochs to 20 for better convergence.
Enabled subword n-grams with min_n=3 and max_n=6 to capture subword info.
Results Interpretation

Before: Spearman correlation = 0.55 (poor generalization)

After: Spearman correlation = 0.72 (better generalization)

Increasing training data, training time, and using subword information helps FastText embeddings better represent rare and unseen words, reducing overfitting and improving generalization.
Bonus Experiment
Try training FastText embeddings with different n-gram ranges (e.g., min_n=2, max_n=5) and compare the effect on word similarity scores.
💡 Hint
Smaller n-grams capture smaller subword units but may add noise; experiment to find the best range.

Practice

(1/5)
1. What is the main advantage of FastText embeddings compared to traditional word embeddings?
easy
A. It considers subword information to handle rare or misspelled words.
B. It only works with whole words and ignores word parts.
C. It requires more memory because it stores entire sentences.
D. It uses images instead of text for embeddings.

Solution

  1. Step 1: Understand FastText's approach to word representation

    FastText breaks words into smaller parts called n-grams, which helps it learn better representations for rare or misspelled words.
  2. Step 2: Compare with traditional embeddings

    Traditional embeddings like Word2Vec treat words as whole units and cannot handle unseen or misspelled words well.
  3. Final Answer:

    It considers subword information to handle rare or misspelled words. -> Option A
  4. Quick Check:

    FastText uses subwords = A [OK]
Hint: Remember: FastText uses word parts, not just whole words [OK]
Common Mistakes:
  • Thinking FastText ignores subwords
  • Confusing FastText with image embeddings
  • Assuming FastText stores full sentences
2. Which of the following is the correct way to load pretrained FastText embeddings using the Gensim library in Python?
easy
A. model = gensim.models.FastText.load_fasttext_format('cc.en.300.bin')
B. model = gensim.load('fasttext_model.bin')
C. model = gensim.models.Word2Vec.load('cc.en.300.bin')
D. model = gensim.models.KeyedVectors.load_word2vec_format('cc.en.300.bin', binary=True)

Solution

  1. Step 1: Identify the correct Gensim function for FastText pretrained vectors

    Gensim uses KeyedVectors.load_word2vec_format with binary=True to load FastText pretrained vectors in .bin format.
  2. Step 2: Check other options for correctness

    model = gensim.models.FastText.load_fasttext_format('cc.en.300.bin') uses a non-existent method. model = gensim.models.Word2Vec.load('cc.en.300.bin') loads Word2Vec models, not FastText. model = gensim.load('fasttext_model.bin') is invalid syntax.
  3. Final Answer:

    model = gensim.models.KeyedVectors.load_word2vec_format('cc.en.300.bin', binary=True) -> Option D
  4. Quick Check:

    Use KeyedVectors.load_word2vec_format for FastText .bin [OK]
Hint: Use KeyedVectors.load_word2vec_format with binary=True for FastText [OK]
Common Mistakes:
  • Using Word2Vec.load for FastText files
  • Calling non-existent load_fasttext_format method
  • Forgetting binary=True for .bin files
3. Given the following Python code using Gensim FastText model:
from gensim.models import FastText
sentences = [['cat', 'sat', 'on', 'mat'], ['dog', 'barked']]
model = FastText(sentences, vector_size=10, window=3, min_count=1, epochs=5)
print(model.wv['cat'])
What will be the output type of model.wv['cat']?
medium
A. A numpy array representing the vector embedding of 'cat'
B. An integer representing the frequency of 'cat'
C. A list of words similar to 'cat'
D. A string with the word 'cat'

Solution

  1. Step 1: Understand what model.wv['word'] returns in Gensim FastText

    model.wv['cat'] returns the vector embedding as a numpy array representing the word 'cat'.
  2. Step 2: Check other options for output type

    A list of words similar to 'cat' is for similar words, not the vector. An integer representing the frequency of 'cat' is frequency, which is not returned here. A string with the word 'cat' is just the word string, not the vector.
  3. Final Answer:

    A numpy array representing the vector embedding of 'cat' -> Option A
  4. Quick Check:

    model.wv['word'] returns vector array [OK]
Hint: model.wv['word'] gives vector array, not word list [OK]
Common Mistakes:
  • Expecting a list of similar words instead of vector
  • Thinking it returns frequency count
  • Confusing word string with vector
4. You trained a FastText model but get a KeyError when trying to get the vector for a word like 'unseenword'. What is the most likely cause and fix?
medium
A. The word is not in the training data; increase epochs to fix.
B. You used Word2Vec instead of FastText; switch to FastText to handle unseen words.
C. FastText cannot handle unseen words; use a different embedding method.
D. The model was not saved properly; reload the model correctly.

Solution

  1. Step 1: Understand FastText's ability with unseen words

    FastText can generate vectors for unseen words by using subword information, unlike Word2Vec.
  2. Step 2: Identify cause of KeyError

    If you get KeyError for unseen words, likely you trained or loaded a Word2Vec model, not FastText.
  3. Final Answer:

    You used Word2Vec instead of FastText; switch to FastText to handle unseen words. -> Option B
  4. Quick Check:

    Use FastText (not Word2Vec) for unseen words [OK]
Hint: KeyError on unseen words means Word2Vec used, not FastText [OK]
Common Mistakes:
  • Assuming FastText can't handle unseen words
  • Trying to fix by increasing epochs only
  • Ignoring model type mismatch
5. You want to improve a text classification model's ability to understand misspelled words using FastText embeddings. Which approach is best?
hard
A. Use one-hot encoding instead of embeddings to avoid misspellings.
B. Use pretrained Word2Vec embeddings and ignore misspelled words during training.
C. Train FastText on your dataset with subword information enabled and use its vectors as input features.
D. Replace all misspelled words with a special token before training with any embeddings.

Solution

  1. Step 1: Identify how FastText handles misspelled words

    FastText uses subword (character n-gram) information, so it can create embeddings for misspelled or rare words.
  2. Step 2: Choose the best approach to leverage this feature

    Training FastText on your dataset with subword info enabled and using its vectors as features helps the model understand misspellings better.
  3. Final Answer:

    Train FastText on your dataset with subword information enabled and use its vectors as input features. -> Option C
  4. Quick Check:

    Train FastText with subwords for misspellings [OK]
Hint: Train FastText with subwords to handle misspellings [OK]
Common Mistakes:
  • Using Word2Vec ignoring misspellings
  • Replacing misspellings with tokens loses info
  • Using one-hot encoding loses semantic info