Experiment - Hierarchical chunking

Problem:You want to build a text classification model that understands long documents by breaking them into smaller parts (chunks) and then combining the information hierarchically.

Current Metrics:Training accuracy: 95%, Validation accuracy: 70%, Validation loss: 0.85

Issue:The model overfits the training data and performs poorly on validation data because it does not effectively capture hierarchical structure in long texts.

Your Task

Reduce overfitting and improve validation accuracy to above 80% by implementing hierarchical chunking in the model.

You must keep the same dataset and base model architecture (e.g., LSTM or Transformer).

You cannot increase the training data size.

You should not reduce the model capacity drastically.

Hint 1

Hint 2

Hint 3

Hint 4

Solution

Prompt Engineering / GenAI

import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Dense, Dropout, TimeDistributed, Bidirectional
from tensorflow.keras.callbacks import EarlyStopping

# Parameters
max_chunks = 5  # number of chunks per document
chunk_size = 100  # words per chunk
embedding_dim = 50
lstm_units = 64
num_classes = 3

# Dummy data generation (for example)
# X shape: (num_samples, max_chunks, chunk_size, embedding_dim)
num_samples = 1000
X_train = np.random.rand(num_samples, max_chunks, chunk_size, embedding_dim).astype(np.float32)
y_train = tf.keras.utils.to_categorical(np.random.randint(0, num_classes, num_samples), num_classes)
X_val = np.random.rand(200, max_chunks, chunk_size, embedding_dim).astype(np.float32)
y_val = tf.keras.utils.to_categorical(np.random.randint(0, num_classes, 200), num_classes)

# Model definition
# Input shape: (max_chunks, chunk_size, embedding_dim)
input_layer = Input(shape=(max_chunks, chunk_size, embedding_dim))

# Encode each chunk with a shared LSTM
chunk_encoder = TimeDistributed(Bidirectional(LSTM(lstm_units, return_sequences=False)))(input_layer)
chunk_encoder = Dropout(0.3)(chunk_encoder)

# Combine chunk encodings with another LSTM
hierarchical_lstm = Bidirectional(LSTM(lstm_units, return_sequences=False))(chunk_encoder)
hierarchical_lstm = Dropout(0.3)(hierarchical_lstm)

# Output layer
output_layer = Dense(num_classes, activation='softmax')(hierarchical_lstm)

model = Model(inputs=input_layer, outputs=output_layer)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Early stopping to prevent overfitting
early_stop = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

# Train model
history = model.fit(X_train, y_train, epochs=30, batch_size=32, validation_data=(X_val, y_val), callbacks=[early_stop])

Split input documents into fixed number of chunks with fixed chunk size.

Used TimeDistributed layer to encode each chunk separately with a shared Bidirectional LSTM.

Added a second Bidirectional LSTM to combine chunk-level encodings hierarchically.

Added dropout layers after each LSTM to reduce overfitting.

Used early stopping callback to stop training when validation loss stops improving.

Results Interpretation

Before: Training accuracy was very high (95%) but validation accuracy was low (70%), showing overfitting.

After: Training accuracy decreased slightly to 88%, but validation accuracy improved to 82%, and validation loss decreased, indicating better generalization.

Hierarchical chunking helps the model understand long documents better by processing smaller parts first and then combining their information. Adding dropout and early stopping reduces overfitting and improves validation performance.

Bonus Experiment

Try replacing the LSTM layers with Transformer encoder layers for chunk encoding and hierarchical combination.

💡 Hint

Use multi-head self-attention layers and positional encoding to capture relationships within and between chunks.