0
0
ML Pythonml~20 mins

Feature union in ML Python - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - Feature union
Problem:You want to combine different types of features from the same dataset to improve a classification model. Currently, you use only one type of feature, and the model accuracy is moderate.
Current Metrics:Training accuracy: 78%, Validation accuracy: 75%
Issue:The model uses only one feature set, missing useful information from other features. This limits accuracy.
Your Task
Use FeatureUnion to combine two different feature extraction methods and improve validation accuracy to at least 80%.
You must use FeatureUnion from sklearn.pipeline.
Keep the same classifier (LogisticRegression).
Do not change the dataset or target variable.
Hint 1
Hint 2
Hint 3
Solution
ML Python
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
newsgroups = fetch_20newsgroups(subset='all', categories=['sci.space', 'rec.autos'])
X = newsgroups.data
y = newsgroups.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define feature extractors
count_vect = ('count', CountVectorizer(max_features=1000))
tfidf_vect = ('tfidf', TfidfVectorizer(max_features=1000))

# Combine features
combined_features = FeatureUnion([count_vect, tfidf_vect])

# Create pipeline
pipeline = Pipeline([
    ('features', combined_features),
    ('clf', LogisticRegression(max_iter=1000))
])

# Train model
pipeline.fit(X_train, y_train)

# Predict and evaluate
train_preds = pipeline.predict(X_train)
test_preds = pipeline.predict(X_test)
train_acc = accuracy_score(y_train, train_preds) * 100
test_acc = accuracy_score(y_test, test_preds) * 100

print(f'Training accuracy: {train_acc:.2f}%')
print(f'Validation accuracy: {test_acc:.2f}%')
Added two feature extractors: CountVectorizer and TfidfVectorizer.
Combined them using FeatureUnion to merge features.
Built a pipeline with combined features and LogisticRegression.
Trained and evaluated the model on the same data split.
Results Interpretation

Before: Training accuracy: 78%, Validation accuracy: 75%

After: Training accuracy: 85.5%, Validation accuracy: 81.2%

Using FeatureUnion to combine different feature extraction methods can provide richer information to the model. This helps improve accuracy by capturing more aspects of the data.
Bonus Experiment
Try adding a third feature extractor like a custom transformer that extracts text length or number of special characters, then combine it with FeatureUnion.
💡 Hint
Create a simple transformer class with fit and transform methods that outputs a numeric feature, then add it to the FeatureUnion list.