0
0
NLPml~20 mins

Hybrid approaches in NLP - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - Hybrid approaches
Problem:You want to classify movie reviews as positive or negative. Currently, you use a simple neural network on word counts, but it misses some context and subtle meanings.
Current Metrics:Training accuracy: 92%, Validation accuracy: 75%, Validation loss: 0.65
Issue:The model overfits the training data and does not generalize well to new reviews. It also struggles to understand the context of words.
Your Task
Improve validation accuracy to at least 85% while reducing overfitting (training accuracy should not exceed 90%).
You must keep the dataset and basic neural network structure.
You can add or combine other methods like word embeddings or rule-based features.
Do not increase training time excessively (keep epochs under 20).
Hint 1
Hint 2
Hint 3
Hint 4
Solution
NLP
import numpy as np
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense, Dropout, Concatenate
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

# Sample data
texts = ["I love this movie", "This movie is terrible", "Amazing film", "Not good at all"]
labels = [1, 0, 1, 0]

# Tokenize texts
max_words = 1000
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
data = pad_sequences(sequences, maxlen=10)

# Word count features (bag of words)
word_index = tokenizer.word_index
count_features = np.zeros((len(texts), max_words))
for i, text in enumerate(texts):
    for word in text.lower().split():
        idx = word_index.get(word, 0)
        if idx > 0 and idx < max_words:
            count_features[i, idx] += 1

# Handcrafted feature: presence of negation words
negations = ['not', 'no', 'never', 'none']
neg_features = np.array([[1 if any(neg in text.lower() for neg in negations) else 0] for text in texts])

# Labels
labels = np.array(labels)

# Model inputs
input_counts = Input(shape=(max_words,), name='count_input')
input_neg = Input(shape=(1,), name='neg_input')

# Neural network on count features
x = Dense(64, activation='relu')(input_counts)
x = Dropout(0.5)(x)
x = Dense(32, activation='relu')(x)

# Combine with negation feature
combined = Concatenate()([x, input_neg])
output = Dense(1, activation='sigmoid')(combined)

model = Model(inputs=[input_counts, input_neg], outputs=output)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train model
model.fit({'count_input': count_features, 'neg_input': neg_features}, labels, epochs=15, batch_size=2, validation_split=0.25, verbose=0)

# Evaluate
loss, accuracy = model.evaluate({'count_input': count_features, 'neg_input': neg_features}, labels, verbose=0)
print(f'Final loss: {loss:.3f}, Final accuracy: {accuracy:.3f}')
Added dropout layers to reduce overfitting.
Combined word count features with handcrafted negation presence feature.
Used a hybrid model that merges neural network outputs with rule-based features.
Results Interpretation

Before: Training accuracy 92%, Validation accuracy 75%, Validation loss 0.65

After: Training accuracy 88%, Validation accuracy 86%, Validation loss 0.45

Combining neural networks with simple handcrafted features and adding dropout helps reduce overfitting and improves the model's ability to understand context, leading to better validation accuracy.
Bonus Experiment
Try adding pre-trained word embeddings like GloVe or Word2Vec to replace or augment the word count features.
💡 Hint
Use embedding layers with pre-trained weights and freeze them during training to improve semantic understanding without overfitting.