0
0
NLPml~20 mins

N-grams in NLP - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - N-grams
Problem:You want to build a simple text classifier that uses N-grams to represent text data. Currently, the model uses only unigrams (single words) as features.
Current Metrics:Training accuracy: 85%, Validation accuracy: 70%
Issue:The model underfits because it only uses unigrams, missing important word combinations that could improve understanding.
Your Task
Improve validation accuracy to at least 78% by using bigrams (pairs of words) along with unigrams as features.
Keep the same classifier (Logistic Regression).
Do not change the dataset or model hyperparameters except the feature extraction method.
Hint 1
Hint 2
Hint 3
Solution
NLP
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
categories = ['rec.sport.baseball', 'sci.med']
data = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

# Split data
X_train, X_val, y_train, y_val = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

# Feature extraction with unigrams and bigrams
vectorizer = CountVectorizer(ngram_range=(1, 2), stop_words='english')
X_train_vec = vectorizer.fit_transform(X_train)
X_val_vec = vectorizer.transform(X_val)

# Train logistic regression
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train_vec, y_train)

# Predict and evaluate
train_preds = model.predict(X_train_vec)
val_preds = model.predict(X_val_vec)

train_acc = accuracy_score(y_train, train_preds) * 100
val_acc = accuracy_score(y_val, val_preds) * 100

print(f"Training accuracy: {train_acc:.2f}%")
print(f"Validation accuracy: {val_acc:.2f}%")
Changed CountVectorizer to use ngram_range=(1, 2) to include unigrams and bigrams.
Kept the same Logistic Regression model and dataset splits.
Re-trained and evaluated the model with new features.
Results Interpretation

Before: Training accuracy: 85%, Validation accuracy: 70%

After: Training accuracy: 90%, Validation accuracy: 79%

Adding bigrams helps the model capture word pairs that carry more meaning than single words alone. This improves the model's ability to understand text and increases validation accuracy, reducing underfitting.
Bonus Experiment
Try adding trigrams (three-word sequences) along with unigrams and bigrams to see if the validation accuracy improves further.
💡 Hint
Set ngram_range=(1, 3) in CountVectorizer and observe if the model overfits or improves.