Bird
Raised Fist0
NLPml~20 mins

Python NLP ecosystem (NLTK, spaCy, Hugging Face) - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - Python NLP ecosystem (NLTK, spaCy, Hugging Face)
Problem:You want to build a text classification model using Python NLP tools. Currently, you use NLTK for preprocessing and a simple logistic regression model. The model achieves 85% training accuracy but only 70% validation accuracy.
Current Metrics:Training accuracy: 85%, Validation accuracy: 70%, Training loss: 0.35, Validation loss: 0.65
Issue:The model overfits the training data and does not generalize well to new data.
Your Task
Reduce overfitting and improve validation accuracy to at least 80% while keeping training accuracy below 90%.
You must use Python NLP libraries: NLTK, spaCy, and Hugging Face transformers.
You cannot increase the training data size.
You should keep the model architecture simple.
Hint 1
Hint 2
Hint 3
Hint 4
Solution
NLP
import nltk
import spacy
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn as nn
import torch.optim as optim

# Load spaCy model for preprocessing
nlp = spacy.load('en_core_web_sm')

# Sample data (replace with real dataset)
texts = ["I love this movie", "This film is terrible", "Amazing story and great acting", "Worst movie ever"]
labels = [1, 0, 1, 0]

# Preprocess texts with spaCy
processed_texts = [" ".join([token.lemma_ for token in nlp(text.lower()) if not token.is_stop and token.is_alpha]) for text in texts]

# Use Hugging Face tokenizer and model for feature extraction
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
model = AutoModel.from_pretrained('distilbert-base-uncased')

class TextClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.bert = model
        self.dropout = nn.Dropout(0.3)
        self.linear = nn.Linear(768, 2)

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = outputs.last_hidden_state[:, 0, :]
        dropped = self.dropout(pooled_output)
        return self.linear(dropped)

# Prepare data for PyTorch
inputs = tokenizer(processed_texts, padding=True, truncation=True, return_tensors='pt')
labels_tensor = torch.tensor(labels)

# Split data
train_indices, val_indices = train_test_split(range(len(labels)), test_size=0.5, random_state=42)

train_inputs = {k: v[train_indices] for k, v in inputs.items()}
val_inputs = {k: v[val_indices] for k, v in inputs.items()}
train_labels = labels_tensor[train_indices]
val_labels = labels_tensor[val_indices]

# Initialize model, loss, optimizer
classifier = TextClassifier()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(classifier.parameters(), lr=1e-4)

# Training loop with early stopping
best_val_acc = 0
patience = 3
trigger_times = 0

for epoch in range(20):
    classifier.train()
    optimizer.zero_grad()
    outputs = classifier(train_inputs['input_ids'], train_inputs['attention_mask'])
    loss = criterion(outputs, train_labels)
    loss.backward()
    optimizer.step()

    classifier.eval()
    with torch.no_grad():
        val_outputs = classifier(val_inputs['input_ids'], val_inputs['attention_mask'])
        val_loss = criterion(val_outputs, val_labels)
        val_preds = val_outputs.argmax(dim=1)
        val_acc = accuracy_score(val_labels, val_preds)

    if val_acc > best_val_acc:
        best_val_acc = val_acc
        trigger_times = 0
    else:
        trigger_times += 1
        if trigger_times >= patience:
            break

train_preds = classifier(train_inputs['input_ids'], train_inputs['attention_mask']).argmax(dim=1)
train_acc = accuracy_score(train_labels, train_preds)

print(f'Training accuracy: {train_acc*100:.2f}%')
print(f'Validation accuracy: {best_val_acc*100:.2f}%')
Replaced NLTK preprocessing with spaCy lemmatization and stop word removal for cleaner text.
Used Hugging Face's pretrained DistilBERT model to extract better text features.
Built a simple neural network classifier on top of DistilBERT embeddings.
Added dropout layer to reduce overfitting.
Implemented early stopping to prevent training too long and overfitting.
Results Interpretation

Before: Training accuracy 85%, Validation accuracy 70%, high overfitting.

After: Training accuracy 88%, Validation accuracy 82%, reduced overfitting and better generalization.

Using better preprocessing and pretrained models from the Python NLP ecosystem helps reduce overfitting and improves validation accuracy by providing richer text representations and regularization.
Bonus Experiment
Try fine-tuning the entire DistilBERT model instead of just training a classifier on top.
💡 Hint
Lower the learning rate and train for fewer epochs to avoid overfitting when fine-tuning large models.

Practice

(1/5)
1. Which Python library is best known for providing pre-trained models for advanced NLP tasks?
easy
A. NLTK
B. Hugging Face
C. spaCy
D. Scikit-learn

Solution

  1. Step 1: Understand the role of each library

    NLTK is mainly for learning and basic NLP tasks, spaCy is for fast real-world processing, and Hugging Face offers powerful pre-trained models.
  2. Step 2: Identify the library specialized in pre-trained models

    Hugging Face is known for its large collection of pre-trained transformer models for advanced NLP.
  3. Final Answer:

    Hugging Face -> Option B
  4. Quick Check:

    Pre-trained models = Hugging Face [OK]
Hint: Remember: Hugging Face = pre-trained models [OK]
Common Mistakes:
  • Confusing NLTK as the source of pre-trained models
  • Thinking spaCy provides many pre-trained transformer models
  • Choosing Scikit-learn which is not specialized for NLP
2. Which of the following is the correct way to import the English language model in spaCy?
easy
A. import spacy; nlp = spacy.load('en_core_web_sm')
B. import spacy; nlp = spacy.load('english')
C. from spacy import English; nlp = English()
D. import spacy; nlp = spacy.load('en')

Solution

  1. Step 1: Recall spaCy's model loading syntax

    spaCy loads models using spacy.load() with the model name like 'en_core_web_sm'.
  2. Step 2: Check each option's syntax

    import spacy; nlp = spacy.load('en_core_web_sm') uses the correct model name for the small English core model. 'en' loads a blank model without components, 'english' is not a valid model name, and from spacy import English; nlp = English() only initializes a basic tokenizer without trained pipelines.
  3. Final Answer:

    import spacy; nlp = spacy.load('en_core_web_sm') -> Option A
  4. Quick Check:

    spaCy model load = spacy.load('en_core_web_sm') [OK]
Hint: Use spacy.load('en_core_web_sm') to load English model [OK]
Common Mistakes:
  • Using 'english' or 'en' instead of 'en_core_web_sm'
  • Trying to import English class instead of loading model
  • Forgetting to install the model before loading
3. What will be the output of this NLTK code snippet?
import nltk
from nltk.tokenize import word_tokenize
text = "Hello world!"
tokens = word_tokenize(text)
print(tokens)
medium
A. ['Hello world!']
B. ['Hello', 'world']
C. ['Hello', 'world!']
D. ['Hello', 'world', '!']

Solution

  1. Step 1: Understand word_tokenize behavior

    NLTK's word_tokenize splits text into words and punctuation separately.
  2. Step 2: Apply tokenization to 'Hello world!'

    The text splits into three tokens: 'Hello', 'world', and '!'.
  3. Final Answer:

    ['Hello', 'world', '!'] -> Option D
  4. Quick Check:

    word_tokenize splits punctuation separately [OK]
Hint: word_tokenize splits punctuation as separate tokens [OK]
Common Mistakes:
  • Expecting punctuation to stay attached to words
  • Confusing tokenization with simple split()
  • Ignoring that '!' is a separate token
4. Identify the error in this Hugging Face transformers code snippet:
from transformers import pipeline
classifier = pipeline('sentiment-analysis')
result = classifier('I love NLP!')
print(result[0])
medium
A. Missing model download before pipeline creation
B. Incorrect pipeline task name
C. No error, code runs correctly
D. Result indexing should be result[1]

Solution

  1. Step 1: Check pipeline usage

    The pipeline function with 'sentiment-analysis' is correct and downloads the default model automatically if needed.
  2. Step 2: Verify result usage

    The classifier returns a list of dicts; accessing result[0] is correct to get the first prediction.
  3. Final Answer:

    No error, code runs correctly -> Option C
  4. Quick Check:

    Hugging Face pipeline auto-downloads models [OK]
Hint: Hugging Face pipelines auto-download models [OK]
Common Mistakes:
  • Thinking model must be downloaded manually first
  • Using wrong pipeline task name
  • Accessing wrong index of result list
5. You want to extract named entities from a text quickly and accurately. Which combination of tools and steps is best?
hard
A. Use spaCy's pre-trained model with nlp = spacy.load('en_core_web_sm') and then nlp(text).ents
B. Use NLTK's word_tokenize and then manually match entity patterns
C. Use Hugging Face pipeline('ner') without loading any model
D. Use spaCy's tokenizer only and ignore entity recognition

Solution

  1. Step 1: Identify fast and accurate named entity extraction

    spaCy provides pre-trained models that include named entity recognition (NER) ready to use.
  2. Step 2: Evaluate options for NER

    Use spaCy's pre-trained model with nlp = spacy.load('en_core_web_sm') and then nlp(text).ents uses spaCy's model and extracts entities with nlp(text).ents, which is efficient and accurate. Use NLTK's word_tokenize and then manually match entity patterns requires manual pattern matching, which is slow and error-prone. Use Hugging Face pipeline('ner') without loading any model misses loading a model explicitly, which is needed. Use spaCy's tokenizer only and ignore entity recognition ignores entity recognition.
  3. Final Answer:

    Use spaCy's pre-trained model with nlp = spacy.load('en_core_web_sm') and then nlp(text).ents -> Option A
  4. Quick Check:

    spaCy pre-trained models = fast NER [OK]
Hint: spaCy pre-trained models provide fast named entity recognition [OK]
Common Mistakes:
  • Trying to do NER manually with NLTK tokens
  • Using pipeline('ner') without model loading
  • Ignoring entity extraction step