How to Use LSTM for Text Classification in NLP
To use
LSTM for text classification in NLP, first convert text into sequences of numbers, then feed these sequences into an LSTM layer to capture word order and context. Finally, add a dense layer with activation like softmax or sigmoid to classify the text into categories.Syntax
An LSTM model for text classification typically includes these parts:
- Embedding layer: Converts words into vectors.
- LSTM layer: Processes sequences to learn context.
- Dense layer: Outputs class probabilities.
Example syntax uses tf.keras.Sequential to stack layers.
python
model = tf.keras.Sequential([
tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_length),
tf.keras.layers.LSTM(units=64),
tf.keras.layers.Dense(num_classes, activation='softmax')
])
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])Example
This example shows how to prepare text data, build an LSTM model, train it, and evaluate accuracy on a simple dataset.
python
import tensorflow as tf from tensorflow.keras.preprocessing.text import Tokenizer from tensorflow.keras.preprocessing.sequence import pad_sequences # Sample data texts = ['I love machine learning', 'Deep learning is fun', 'I hate bugs', 'Debugging is boring'] labels = [1, 1, 0, 0] # 1=positive, 0=negative # Tokenize and pad sequences vocab_size = 50 max_length = 5 embedding_dim = 8 tokenizer = Tokenizer(num_words=vocab_size, oov_token='<OOV>') tokenizer.fit_on_texts(texts) sequences = tokenizer.texts_to_sequences(texts) padded = pad_sequences(sequences, maxlen=max_length, padding='post') # Build model model = tf.keras.Sequential([ tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length), tf.keras.layers.LSTM(16), tf.keras.layers.Dense(1, activation='sigmoid') ]) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # Train model model.fit(padded, labels, epochs=10, verbose=0) # Evaluate loss, accuracy = model.evaluate(padded, labels, verbose=0) print(f'Accuracy: {accuracy:.2f}')
Output
Accuracy: 1.00
Common Pitfalls
Common mistakes when using LSTM for text classification include:
- Not padding sequences to the same length, causing errors.
- Using incorrect loss functions for the task (e.g., using
categorical_crossentropywith integer labels). - Ignoring the need to tokenize and convert text to numbers before feeding into the model.
- Choosing too large or too small LSTM units without tuning.
Always preprocess text properly and match loss function to label format.
python
import tensorflow as tf # Wrong: No padding texts = ['hello world', 'hi'] # Tokenizer and sequences omitted for brevity # Feeding sequences of different lengths directly causes errors # Right: Pad sequences from tensorflow.keras.preprocessing.sequence import pad_sequences sequences = [[1, 2], [3]] padded = pad_sequences(sequences, maxlen=2, padding='post') # [[1, 2], [3, 0]] # Use correct loss for binary labels model.compile(loss='binary_crossentropy', optimizer='adam')
Quick Reference
- Embedding layer: Converts words to vectors.
- LSTM layer: Captures sequence context.
- Dense layer: Outputs class probabilities.
- Loss function: Use
binary_crossentropyfor two classes,sparse_categorical_crossentropyfor multiple classes with integer labels. - Preprocessing: Tokenize text and pad sequences to same length.
Key Takeaways
Always convert text to padded sequences before feeding into LSTM.
Use an embedding layer to turn words into vectors for the LSTM.
Choose the right loss function based on your label format.
LSTM layers help capture word order and context in text.
Tune LSTM units and training epochs for best accuracy.
