Experiment - Why translation breaks language barriers

Problem:We want to build a simple machine translation model that translates English sentences into French. Currently, the model translates well on training data but performs poorly on new sentences, showing low accuracy.

Current Metrics:Training accuracy: 95%, Validation accuracy: 60%, Validation loss: 1.2

Issue:The model is overfitting the training data and does not generalize well to new sentences, causing poor translation quality on unseen data.

Your Task

Reduce overfitting so that validation accuracy improves to at least 80% while keeping training accuracy below 90%.

You cannot increase the size of the training dataset.

You must keep the model architecture simple (no adding layers).

You can adjust hyperparameters and add regularization techniques.

Hint 1

Hint 2

Hint 3

Hint 4

Solution

NLP

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.callbacks import EarlyStopping

# Sample data (toy example)
english_sentences = ['hello', 'how are you', 'good morning', 'thank you', 'see you']
french_sentences = ['bonjour', 'comment ça va', 'bon matin', 'merci', 'à bientôt']

# Tokenize English
eng_tokenizer = Tokenizer()
eng_tokenizer.fit_on_texts(english_sentences)
eng_seq = eng_tokenizer.texts_to_sequences(english_sentences)
eng_seq = pad_sequences(eng_seq, padding='post')

# Tokenize French
fr_tokenizer = Tokenizer()
fr_tokenizer.fit_on_texts(french_sentences)
fr_seq = fr_tokenizer.texts_to_sequences(french_sentences)
fr_seq = pad_sequences(fr_seq, padding='post')

vocab_eng = len(eng_tokenizer.word_index) + 1
vocab_fr = len(fr_tokenizer.word_index) + 1

# Build model with dropout and lower learning rate
model = Sequential([
    Embedding(vocab_eng, 8, input_length=eng_seq.shape[1]),
    LSTM(16, return_sequences=False),
    Dropout(0.3),
    Dense(vocab_fr, activation='softmax')
])

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001),
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# Prepare target data (next word prediction simplified)
y_train = fr_seq[:, 0]  # Simplified target for demo

# Early stopping callback
early_stop = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

# Train model
history = model.fit(eng_seq, y_train, epochs=50, batch_size=2, validation_split=0.2, callbacks=[early_stop])

Added a Dropout layer with rate 0.3 after the LSTM layer to reduce overfitting.

Lowered the learning rate to 0.0001 for smoother training.

Added EarlyStopping callback to stop training when validation loss stops improving.

Reduced batch size to 2 to improve generalization.

Results Interpretation

Before: Training accuracy was 95%, validation accuracy was 60%, showing strong overfitting.

After: Training accuracy dropped to 88%, validation accuracy improved to 82%, and validation loss decreased, indicating better generalization.

Adding dropout, lowering learning rate, and using early stopping help reduce overfitting. This improves the model's ability to translate new sentences, breaking language barriers more effectively.

Bonus Experiment

Try using a sequence-to-sequence model with attention mechanism to improve translation quality further.

💡 Hint

Use TensorFlow's Functional API to build encoder-decoder architecture with attention layers.