Text Clustering in Python for NLP: Simple Guide with Example
To do
text clustering in Python for NLP, first convert text data into numerical vectors using methods like TfidfVectorizer. Then apply clustering algorithms such as KMeans to group similar texts automatically.Syntax
Text clustering in Python typically involves these steps:
- Vectorization: Convert text into numbers using
TfidfVectorizer(). - Clustering: Use
KMeans(n_clusters=number)to group texts. - Fit and predict: Call
fit()on vectors andpredict()to assign clusters.
python
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.cluster import KMeans # Step 1: Convert texts to vectors vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(texts) # Step 2: Create KMeans model kmeans = KMeans(n_clusters=3, random_state=42) # Step 3: Fit model and get clusters kmeans.fit(X) clusters = kmeans.predict(X)
Example
This example shows how to cluster a list of short texts into groups using TfidfVectorizer and KMeans. It prints each text with its assigned cluster.
python
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.cluster import KMeans texts = [ "I love reading books about history.", "The movie was fantastic and thrilling.", "History books provide great knowledge.", "That thriller movie kept me on edge.", "Reading novels is a relaxing hobby.", "Movies can be very entertaining and fun." ] vectorizer = TfidfVectorizer(stop_words='english') X = vectorizer.fit_transform(texts) kmeans = KMeans(n_clusters=2, random_state=42) kmeans.fit(X) clusters = kmeans.labels_ for text, cluster in zip(texts, clusters): print(f"Cluster {cluster}: {text}")
Output
Cluster 1: I love reading books about history.
Cluster 0: The movie was fantastic and thrilling.
Cluster 1: History books provide great knowledge.
Cluster 0: That thriller movie kept me on edge.
Cluster 1: Reading novels is a relaxing hobby.
Cluster 0: Movies can be very entertaining and fun.
Common Pitfalls
Common mistakes when doing text clustering include:
- Not removing stop words, which adds noise to vectors.
- Choosing too many or too few clusters without testing.
- Using raw text instead of vectorized data for clustering.
- Ignoring random state, which makes results non-reproducible.
Always preprocess text and experiment with cluster numbers.
python
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.cluster import KMeans texts = ["Cats are cute.", "Dogs are loyal.", "I love my cat."] # Wrong: Using raw texts directly # kmeans = KMeans(n_clusters=2) # kmeans.fit(texts) # This will raise an error # Right: Vectorize texts first vectorizer = TfidfVectorizer(stop_words='english') X = vectorizer.fit_transform(texts) kmeans = KMeans(n_clusters=2, random_state=0) kmeans.fit(X) print(kmeans.labels_)
Output
[1 0 1]
Quick Reference
Tips for effective text clustering:
- Use
TfidfVectorizerwithstop_words='english'to clean text. - Pick
n_clustersbased on domain knowledge or use methods like the elbow method. - Set
random_statefor reproducible results. - Evaluate clusters qualitatively by reading sample texts per cluster.
Key Takeaways
Convert text to numerical vectors using TfidfVectorizer before clustering.
Use KMeans with a chosen number of clusters to group similar texts.
Remove stop words to improve clustering quality.
Set random_state in KMeans for consistent results.
Test different cluster counts and evaluate clusters manually.
