NLPml~20 mins

N-grams in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - N-grams

Problem:You want to build a simple text classifier that uses N-grams to represent text data. Currently, the model uses only unigrams (single words) as features.

Current Metrics:Training accuracy: 85%, Validation accuracy: 70%

Issue:The model underfits because it only uses unigrams, missing important word combinations that could improve understanding.

Your Task

Improve validation accuracy to at least 78% by using bigrams (pairs of words) along with unigrams as features.

Keep the same classifier (Logistic Regression).

Do not change the dataset or model hyperparameters except the feature extraction method.

Hint 1

Hint 2

Hint 3

Solution

NLP

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
categories = ['rec.sport.baseball', 'sci.med']
data = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

# Split data
X_train, X_val, y_train, y_val = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

# Feature extraction with unigrams and bigrams
vectorizer = CountVectorizer(ngram_range=(1, 2), stop_words='english')
X_train_vec = vectorizer.fit_transform(X_train)
X_val_vec = vectorizer.transform(X_val)

# Train logistic regression
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train_vec, y_train)

# Predict and evaluate
train_preds = model.predict(X_train_vec)
val_preds = model.predict(X_val_vec)

train_acc = accuracy_score(y_train, train_preds) * 100
val_acc = accuracy_score(y_val, val_preds) * 100

print(f"Training accuracy: {train_acc:.2f}%")
print(f"Validation accuracy: {val_acc:.2f}%")

Changed CountVectorizer to use ngram_range=(1, 2) to include unigrams and bigrams.

Kept the same Logistic Regression model and dataset splits.

Re-trained and evaluated the model with new features.

Results Interpretation

Before: Training accuracy: 85%, Validation accuracy: 70%

After: Training accuracy: 90%, Validation accuracy: 79%

Adding bigrams helps the model capture word pairs that carry more meaning than single words alone. This improves the model's ability to understand text and increases validation accuracy, reducing underfitting.

Bonus Experiment

Try adding trigrams (three-word sequences) along with unigrams and bigrams to see if the validation accuracy improves further.

💡 Hint

Set ngram_range=(1, 3) in CountVectorizer and observe if the model overfits or improves.

Practice

(1/5)

1. What is an n-gram in natural language processing?

easy

A. A random selection of n words from a text

B. A single word repeated n times

C. A sentence with n words

D. A group of n consecutive words in a text

N-grams in NLP - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand the definition of n-gram

Step 2: Compare options with definition

Final Answer:

Quick Check:

Solution

Step 1: Understand ngram_range parameter

Step 2: Evaluate each option

Final Answer:

Quick Check:

Solution

Step 1: Understand trigram extraction

Step 2: List trigrams from the sentence

Final Answer:

Quick Check:

Solution

Step 1: Check method usage

Step 2: Validate other parts

Final Answer:

Quick Check:

Solution

Step 1: Understand requirements

Step 2: Evaluate options

Final Answer:

Quick Check: