SciPydata~30 mins

SciPy with scikit-learn pipeline - Mini Project: Build & Apply

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Build a SciPy and scikit-learn Pipeline for Data Transformation and Modeling

📖 Scenario: You are working as a data analyst for a small company. You have some data about customers' ages and incomes, and you want to predict their spending score. To do this, you will prepare the data using SciPy and then build a simple model using scikit-learn's pipeline feature.

🎯 Goal: Create a data dictionary with customer data, set up a configuration variable for a threshold, build a scikit-learn pipeline that uses a SciPy function to transform data and a simple model, then output the transformed data and model predictions.

📋 What You'll Learn

Create a dictionary called customer_data with keys 'age' and 'income' and the exact lists of values provided.

Create a variable called income_threshold and set it to the exact value 50000.

Build a scikit-learn pipeline named pipeline that uses a SciPy function to apply a logarithm transformation to income and a simple linear regression model.

Print the transformed income data and the model predictions exactly as specified.

💡 Why This Matters

🌍 Real World

Data scientists often need to preprocess data using mathematical functions from libraries like SciPy before feeding it into machine learning models. Pipelines help organize these steps cleanly.

💼 Career

Understanding how to combine data transformations and models in a pipeline is a key skill for data analysts and data scientists working on predictive modeling tasks.

Progress0 / 4 steps

DATA SETUP: Create the customer data dictionary

Create a dictionary called customer_data with two keys: 'age' and 'income'. Set 'age' to the list [25, 32, 47, 51, 62] and 'income' to the list [40000, 52000, 61000, 58000, 72000].

SciPy

# Create the customer_data dictionary with age and income lists
# Your code here

Need a hint?

Use curly braces to create a dictionary. The keys are 'age' and 'income'. Assign the exact lists to each key.

CONFIGURATION: Set the income threshold

Create a variable called income_threshold and set it to the integer 50000.

SciPy

customer_data = {
    'age': [25, 32, 47, 51, 62],
    'income': [40000, 52000, 61000, 58000, 72000]
}
# Create income_threshold variable and set it to 50000
# Your code here

Need a hint?

Just assign the number 50000 to the variable named income_threshold.

CORE LOGIC: Build the SciPy and scikit-learn pipeline

Import FunctionTransformer from sklearn.preprocessing, LinearRegression from sklearn.linear_model, and log from scipy.special. Then create a pipeline called pipeline that first applies the logarithm transformation to the income data using FunctionTransformer with log, and then fits a LinearRegression model.

SciPy

customer_data = {
    'age': [25, 32, 47, 51, 62],
    'income': [40000, 52000, 61000, 58000, 72000]
}
income_threshold = 50000

# Import FunctionTransformer, LinearRegression, and log
# Create pipeline with log transform and LinearRegression
# Your code here

Need a hint?

Use Pipeline with two steps: a FunctionTransformer that applies log, and a LinearRegression model.

OUTPUT: Transform income and predict spending score

Use the pipeline to fit the model using the income data reshaped as a 2D array. Then print the transformed income data after the log transform step and print the predictions from the linear regression model. Use print(transformed_income) and print(predictions) exactly.

SciPy

from sklearn.preprocessing import FunctionTransformer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from scipy.special import log

customer_data = {
    'age': [25, 32, 47, 51, 62],
    'income': [40000, 52000, 61000, 58000, 72000]
}
income_threshold = 50000

pipeline = Pipeline([
    ('log_transform', FunctionTransformer(log, validate=True)),
    ('linear_model', LinearRegression())
])

# Fit the pipeline with income data reshaped
# Print the transformed income data and predictions
# Your code here

Need a hint?

Use np.array and reshape(-1, 1) to prepare income data. Fit the pipeline with income and age. Use pipeline.named_steps['log_transform'].transform() to get transformed income. Use pipeline.predict() for predictions. Print both results.