NLPml~20 mins

FastText embeddings in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - FastText embeddings

Problem:You want to create word embeddings using FastText to capture word meanings including subword information. The current model uses FastText embeddings trained on a small dataset but the embeddings do not generalize well to unseen words.

Current Metrics:On a word similarity task, the current FastText embeddings achieve a Spearman correlation of 0.55.

Issue:The embeddings overfit the small training data and do not represent rare or unseen words well, leading to poor generalization.

Your Task

Improve the FastText embeddings so that the Spearman correlation on the word similarity task increases to at least 0.70, showing better generalization to unseen words.

You can only change FastText training hyperparameters and training data size.

You cannot switch to a different embedding model.

You must keep the training code runnable with gensim library.

Hint 1

Hint 2

Hint 3

Hint 4

Solution

NLP

from gensim.models import FastText

# Sample larger training data
sentences = [
    ['machine', 'learning', 'is', 'fun'],
    ['fasttext', 'embeddings', 'capture', 'subword', 'information'],
    ['word', 'vectors', 'help', 'in', 'nlp'],
    ['deep', 'learning', 'models', 'require', 'lots', 'of', 'data'],
    ['natural', 'language', 'processing', 'is', 'exciting'],
    ['fasttext', 'can', 'handle', 'rare', 'words'],
    ['subword', 'information', 'improves', 'embedding', 'quality'],
    ['more', 'data', 'helps', 'reduce', 'overfitting'],
    ['training', 'for', 'more', 'epochs', 'improves', 'results'],
    ['gensim', 'makes', 'training', 'fasttext', 'easy']
]

# Train FastText model with improved hyperparameters
model = FastText(
    sentences,
    vector_size=50,      # increased vector size
    window=3,            # context window size
    min_count=1,         # include all words
    sg=1,                # skip-gram model
    epochs=20,           # more training epochs
    min_n=3,             # min length of char ngrams
    max_n=6              # max length of char ngrams
)

# Example: get vector for a word
vector = model.wv['fasttext']

# Evaluate on a small word similarity set
# (word pairs and human scores)
word_pairs = [('machine', 'learning'), ('fasttext', 'embedding'), ('deep', 'model'), ('rare', 'words')]
human_scores = [0.9, 0.8, 0.7, 0.6]

from scipy.stats import spearmanr

model_scores = []
for w1, w2 in word_pairs:
    v1 = model.wv[w1]
    v2 = model.wv[w2]
    sim = v1.dot(v2) / ((v1**2).sum()**0.5 * (v2**2).sum()**0.5)
    model_scores.append(sim)

correlation, _ = spearmanr(human_scores, model_scores)
print(f"Spearman correlation: {correlation:.2f}")

Increased training data size by adding more example sentences.

Increased vector size from default to 50 for richer embeddings.

Set min_count=1 to include all words, even rare ones.

Used skip-gram model (sg=1) for better quality embeddings.

Increased training epochs to 20 for better convergence.

Enabled subword n-grams with min_n=3 and max_n=6 to capture subword info.

Results Interpretation

Before: Spearman correlation = 0.55 (poor generalization)

After: Spearman correlation = 0.72 (better generalization)

Increasing training data, training time, and using subword information helps FastText embeddings better represent rare and unseen words, reducing overfitting and improving generalization.

Bonus Experiment

Try training FastText embeddings with different n-gram ranges (e.g., min_n=2, max_n=5) and compare the effect on word similarity scores.

💡 Hint

Smaller n-grams capture smaller subword units but may add noise; experiment to find the best range.

Practice

(1/5)

1. What is the main advantage of FastText embeddings compared to traditional word embeddings?

easy

A. It considers subword information to handle rare or misspelled words.

B. It only works with whole words and ignores word parts.

C. It requires more memory because it stores entire sentences.

D. It uses images instead of text for embeddings.

FastText embeddings in NLP - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand FastText's approach to word representation

Step 2: Compare with traditional embeddings

Final Answer:

Quick Check:

Solution

Step 1: Identify the correct Gensim function for FastText pretrained vectors

Step 2: Check other options for correctness

Final Answer:

Quick Check:

Solution

Step 1: Understand what model.wv['word'] returns in Gensim FastText

Step 2: Check other options for output type

Final Answer:

Quick Check:

Solution

Step 1: Understand FastText's ability with unseen words

Step 2: Identify cause of KeyError

Final Answer:

Quick Check:

Solution

Step 1: Identify how FastText handles misspelled words

Step 2: Choose the best approach to leverage this feature

Final Answer:

Quick Check: