0
0
NLPml~20 mins

Why text classification categorizes documents in NLP - Experiment to Prove It

Choose your learning style9 modes available
Experiment - Why text classification categorizes documents
Problem:We want to teach a computer to read short text messages and decide what category they belong to, like 'sports', 'technology', or 'health'. Currently, the model guesses correctly 95% of the time on the training messages but only 70% on new messages it hasn't seen before.
Current Metrics:Training accuracy: 95%, Validation accuracy: 70%
Issue:The model is overfitting. It learns the training messages too well but does not generalize to new messages.
Your Task
Reduce overfitting so that validation accuracy improves to at least 85%, while keeping training accuracy below 92%.
You can only change the model architecture and training settings.
Do not add more data or change the dataset.
Keep the text preprocessing steps the same.
Hint 1
Hint 2
Hint 3
Solution
NLP
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, GlobalAveragePooling1D, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping

# Assume X_train, y_train, X_val, y_val are preprocessed and ready

vocab_size = 10000
embedding_dim = 16
max_length = 100

model = Sequential([
    Embedding(vocab_size, embedding_dim, input_length=max_length),
    GlobalAveragePooling1D(),
    Dropout(0.5),
    Dense(16, activation='relu'),
    Dropout(0.5),
    Dense(3, activation='softmax')  # 3 categories
])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

early_stop = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

history = model.fit(
    X_train, y_train,
    epochs=30,
    batch_size=32,
    validation_data=(X_val, y_val),
    callbacks=[early_stop]
)
Added Dropout layers with 50% rate after embedding and dense layers to reduce overfitting.
Reduced the dense layer size from a larger number to 16 neurons to simplify the model.
Added EarlyStopping callback to stop training when validation loss stops improving.
Results Interpretation

Before: Training accuracy was 95%, validation accuracy was 70%, showing overfitting.

After: Training accuracy dropped to 90%, validation accuracy improved to 87%, showing better generalization.

Adding dropout and early stopping helps the model not memorize training data too much, so it performs better on new unseen texts.
Bonus Experiment
Try using a simpler model like logistic regression with TF-IDF features instead of a neural network.
💡 Hint
Use scikit-learn's TfidfVectorizer and LogisticRegression to see if a simpler model can also reduce overfitting.