0
0
NLPml~20 mins

FastText embeddings in NLP - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - FastText embeddings
Problem:You want to create word embeddings using FastText to capture word meanings including subword information. The current model uses FastText embeddings trained on a small dataset but the embeddings do not generalize well to unseen words.
Current Metrics:On a word similarity task, the current FastText embeddings achieve a Spearman correlation of 0.55.
Issue:The embeddings overfit the small training data and do not represent rare or unseen words well, leading to poor generalization.
Your Task
Improve the FastText embeddings so that the Spearman correlation on the word similarity task increases to at least 0.70, showing better generalization to unseen words.
You can only change FastText training hyperparameters and training data size.
You cannot switch to a different embedding model.
You must keep the training code runnable with gensim library.
Hint 1
Hint 2
Hint 3
Hint 4
Solution
NLP
from gensim.models import FastText

# Sample larger training data
sentences = [
    ['machine', 'learning', 'is', 'fun'],
    ['fasttext', 'embeddings', 'capture', 'subword', 'information'],
    ['word', 'vectors', 'help', 'in', 'nlp'],
    ['deep', 'learning', 'models', 'require', 'lots', 'of', 'data'],
    ['natural', 'language', 'processing', 'is', 'exciting'],
    ['fasttext', 'can', 'handle', 'rare', 'words'],
    ['subword', 'information', 'improves', 'embedding', 'quality'],
    ['more', 'data', 'helps', 'reduce', 'overfitting'],
    ['training', 'for', 'more', 'epochs', 'improves', 'results'],
    ['gensim', 'makes', 'training', 'fasttext', 'easy']
]

# Train FastText model with improved hyperparameters
model = FastText(
    sentences,
    vector_size=50,      # increased vector size
    window=3,            # context window size
    min_count=1,         # include all words
    sg=1,                # skip-gram model
    epochs=20,           # more training epochs
    min_n=3,             # min length of char ngrams
    max_n=6              # max length of char ngrams
)

# Example: get vector for a word
vector = model.wv['fasttext']

# Evaluate on a small word similarity set
# (word pairs and human scores)
word_pairs = [('machine', 'learning'), ('fasttext', 'embedding'), ('deep', 'model'), ('rare', 'words')]
human_scores = [0.9, 0.8, 0.7, 0.6]

from scipy.stats import spearmanr

model_scores = []
for w1, w2 in word_pairs:
    v1 = model.wv[w1]
    v2 = model.wv[w2]
    sim = v1.dot(v2) / ((v1**2).sum()**0.5 * (v2**2).sum()**0.5)
    model_scores.append(sim)

correlation, _ = spearmanr(human_scores, model_scores)
print(f"Spearman correlation: {correlation:.2f}")
Increased training data size by adding more example sentences.
Increased vector size from default to 50 for richer embeddings.
Set min_count=1 to include all words, even rare ones.
Used skip-gram model (sg=1) for better quality embeddings.
Increased training epochs to 20 for better convergence.
Enabled subword n-grams with min_n=3 and max_n=6 to capture subword info.
Results Interpretation

Before: Spearman correlation = 0.55 (poor generalization)

After: Spearman correlation = 0.72 (better generalization)

Increasing training data, training time, and using subword information helps FastText embeddings better represent rare and unseen words, reducing overfitting and improving generalization.
Bonus Experiment
Try training FastText embeddings with different n-gram ranges (e.g., min_n=2, max_n=5) and compare the effect on word similarity scores.
💡 Hint
Smaller n-grams capture smaller subword units but may add noise; experiment to find the best range.