NlpHow-ToBeginner · 4 min read

How to Use LSTM for Text Classification in NLP

To use LSTM for text classification in NLP, first convert text into sequences of numbers, then feed these sequences into an LSTM layer to capture word order and context. Finally, add a dense layer with activation like softmax or sigmoid to classify the text into categories.

📐

Syntax

An LSTM model for text classification typically includes these parts:

Embedding layer: Converts words into vectors.
LSTM layer: Processes sequences to learn context.
Dense layer: Outputs class probabilities.

Example syntax uses tf.keras.Sequential to stack layers.

python

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_length),
    tf.keras.layers.LSTM(units=64),
    tf.keras.layers.Dense(num_classes, activation='softmax')
])

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

💻

Example

This example shows how to prepare text data, build an LSTM model, train it, and evaluate accuracy on a simple dataset.

python

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Sample data
texts = ['I love machine learning', 'Deep learning is fun', 'I hate bugs', 'Debugging is boring']
labels = [1, 1, 0, 0]  # 1=positive, 0=negative

# Tokenize and pad sequences
vocab_size = 50
max_length = 5
embedding_dim = 8

tokenizer = Tokenizer(num_words=vocab_size, oov_token='<OOV>')
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
padded = pad_sequences(sequences, maxlen=max_length, padding='post')

# Build model
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.LSTM(16),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train model
model.fit(padded, labels, epochs=10, verbose=0)

# Evaluate
loss, accuracy = model.evaluate(padded, labels, verbose=0)
print(f'Accuracy: {accuracy:.2f}')

Output

Accuracy: 1.00

⚠️

Common Pitfalls

Common mistakes when using LSTM for text classification include:

Not padding sequences to the same length, causing errors.
Using incorrect loss functions for the task (e.g., using categorical_crossentropy with integer labels).
Ignoring the need to tokenize and convert text to numbers before feeding into the model.
Choosing too large or too small LSTM units without tuning.

Always preprocess text properly and match loss function to label format.

python

import tensorflow as tf

# Wrong: No padding
texts = ['hello world', 'hi']
# Tokenizer and sequences omitted for brevity
# Feeding sequences of different lengths directly causes errors

# Right: Pad sequences
from tensorflow.keras.preprocessing.sequence import pad_sequences
sequences = [[1, 2], [3]]
padded = pad_sequences(sequences, maxlen=2, padding='post')  # [[1, 2], [3, 0]]

# Use correct loss for binary labels
model.compile(loss='binary_crossentropy', optimizer='adam')

📊

Quick Reference

Embedding layer: Converts words to vectors.
LSTM layer: Captures sequence context.
Dense layer: Outputs class probabilities.
Loss function: Use binary_crossentropy for two classes, sparse_categorical_crossentropy for multiple classes with integer labels.
Preprocessing: Tokenize text and pad sequences to same length.

✅

Key Takeaways

Always convert text to padded sequences before feeding into LSTM.

Use an embedding layer to turn words into vectors for the LSTM.

Choose the right loss function based on your label format.

LSTM layers help capture word order and context in text.

Tune LSTM units and training epochs for best accuracy.