Bird
Raised Fist0
MLOpsdevops~20 mins

Feature engineering pipelines in MLOps - Practice Problems & Coding Challenges

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Challenge - 5 Problems
🎖️
Feature Engineering Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
2:00remaining
What is the main purpose of a feature engineering pipeline in MLOps?

Choose the best description of why we use feature engineering pipelines in machine learning operations.

ATo collect raw data from various sources.
BTo deploy machine learning models to production environments.
CTo automate and standardize the process of transforming raw data into features for models.
DTo monitor the performance of models after deployment.
Attempts:
2 left
💡 Hint

Think about what happens before training a model with raw data.

💻 Command Output
intermediate
2:00remaining
Output of a feature pipeline step using scikit-learn's ColumnTransformer

Given the following Python code snippet, what is the output shape of X_transformed?

MLOps
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
import numpy as np

X = np.array([[25, 'red'], [30, 'blue'], [22, 'green']])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), [0]),
        ('cat', OneHotEncoder(), [1])
    ])

X_transformed = preprocessor.fit_transform(X)
print(X_transformed.shape)
A(3, 3)
B(3, 4)
C(2, 4)
D(3, 2)
Attempts:
2 left
💡 Hint

Count numeric and categorical features after transformation.

🔀 Workflow
advanced
2:30remaining
Order the steps to build a feature engineering pipeline for a new dataset

Arrange the following steps in the correct order to create a feature engineering pipeline.

A3,1,2,4
B2,1,3,4
C1,3,2,4
D1,2,3,4
Attempts:
2 left
💡 Hint

Think about understanding data first, then defining transformations, then implementation, then testing.

Troubleshoot
advanced
2:30remaining
Why does this feature pipeline raise a ValueError during fit?

Consider this code snippet that raises an error during fit. What is the most likely cause?

MLOps
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
import numpy as np

X = np.array([[1, 2], [np.nan, 3], [7, 6]])

pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

pipeline.fit(X)
APipeline steps are in wrong order; imputer should come before scaler.
BSimpleImputer requires categorical data, but numeric data was given.
CStandardScaler cannot handle NaN values before imputation.
DThe input array X has inconsistent row lengths.
Attempts:
2 left
💡 Hint

Think about which step should handle missing values first.

Best Practice
expert
3:00remaining
Which practice ensures feature engineering pipelines support model reproducibility?

Choose the best practice that helps maintain reproducibility of machine learning models when using feature engineering pipelines.

AVersion control the pipeline code and store pipeline artifacts with the model.
BRun the pipeline only on training data and ignore test data transformations.
CManually apply transformations outside the pipeline for flexibility.
DUse random transformations without fixing seeds to increase data variety.
Attempts:
2 left
💡 Hint

Think about how to keep track of changes and ensure the same transformations are applied later.

Practice

(1/5)
1. What is the main purpose of a feature engineering pipeline in MLOps?
easy
A. To automate and standardize data preparation steps
B. To deploy machine learning models to production
C. To monitor model performance after deployment
D. To collect raw data from external sources

Solution

  1. Step 1: Understand the role of feature engineering pipelines

    Feature engineering pipelines automate the process of transforming raw data into features for model training and testing.
  2. Step 2: Differentiate from other MLOps tasks

    Deploying models, monitoring, and data collection are separate tasks from feature engineering pipelines.
  3. Final Answer:

    To automate and standardize data preparation steps -> Option A
  4. Quick Check:

    Feature engineering pipeline = automate data prep [OK]
Hint: Feature pipelines automate data prep, not deployment or monitoring [OK]
Common Mistakes:
  • Confusing feature pipelines with model deployment
  • Thinking pipelines collect raw data
  • Mixing up monitoring with feature engineering
2. Which of the following is the correct way to define a simple feature engineering pipeline step using scikit-learn's Pipeline?
easy
A. pipeline = Pipeline([('scaler', StandardScaler()), ('pca', PCA(n_components=2))])
B. pipeline = Pipeline('scaler', StandardScaler(), 'pca', PCA(n_components=2))
C. pipeline = Pipeline({'scaler': StandardScaler(), 'pca': PCA(n_components=2)})
D. pipeline = Pipeline(StandardScaler(), PCA(n_components=2))

Solution

  1. Step 1: Recall scikit-learn Pipeline syntax

    Pipeline expects a list of tuples, each tuple with a name and a transformer object.
  2. Step 2: Check each option's syntax

    pipeline = Pipeline([('scaler', StandardScaler()), ('pca', PCA(n_components=2))]) correctly uses a list of tuples. Options B, C, and D use incorrect argument formats.
  3. Final Answer:

    pipeline = Pipeline([('scaler', StandardScaler()), ('pca', PCA(n_components=2))]) -> Option A
  4. Quick Check:

    Pipeline needs list of (name, transformer) tuples [OK]
Hint: Pipeline needs list of (name, transformer) tuples [OK]
Common Mistakes:
  • Passing arguments without list brackets
  • Using dict instead of list of tuples
  • Omitting step names in pipeline
3. Given the following pipeline code, what will be the output of pipeline.transform([[0, 0], [1, 1]])?
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

pipeline = Pipeline([
  ('scaler', StandardScaler()),
  ('pca', PCA(n_components=1))
])
pipeline.fit([[0, 0], [1, 1]])
result = pipeline.transform([[0, 0], [1, 1]])
print(result)
medium
A. Error: PCA requires more than one sample
B. [[0. 0.] [1. 1.]]
C. [[0.5] [0.5]]
D. [[-1.41421356] [ 1.41421356]]

Solution

  1. Step 1: Understand pipeline steps

    First, data is scaled to zero mean and unit variance, then PCA reduces to 1 component.
  2. Step 2: Calculate transformed output

    Scaling [[0,0],[1,1]] centers data, PCA finds principal component; output is approximately [[-1.41421356],[1.41421356]].
  3. Final Answer:

    [[-1.41421356] [ 1.41421356]] -> Option D
  4. Quick Check:

    Scaling + PCA output = [[-1.41421356] [ 1.41421356]] [OK]
Hint: Scaling centers data; PCA output is principal component values [OK]
Common Mistakes:
  • Expecting original data as output
  • Confusing PCA output shape
  • Assuming error due to small data
4. You have this pipeline code but it raises an error: ValueError: Expected 2D array, got 1D array instead. What is the likely cause?
pipeline = Pipeline([
  ('scaler', StandardScaler()),
  ('pca', PCA(n_components=1))
])
pipeline.fit([1, 2, 3, 4])
medium
A. Pipeline steps must be functions, not classes
B. Input to fit should be 2D array, not 1D list
C. StandardScaler requires integer inputs only
D. PCA cannot have n_components=1

Solution

  1. Step 1: Analyze error message

    The error says input is 1D but 2D is expected for fit method.
  2. Step 2: Check input format

    Input [1, 2, 3, 4] is a 1D list; fit expects 2D array like [[1], [2], [3], [4]].
  3. Final Answer:

    Input to fit should be 2D array, not 1D list -> Option B
  4. Quick Check:

    fit input shape must be 2D [OK]
Hint: fit() needs 2D array shape, not flat list [OK]
Common Mistakes:
  • Passing 1D list instead of 2D array
  • Misunderstanding PCA parameter limits
  • Thinking StandardScaler restricts input types
5. You want to create a feature engineering pipeline that handles missing values by filling them with the median, then scales features, and finally selects the top 3 features using a model-based selector. Which pipeline setup is correct?
hard
A. Pipeline([('scaler', StandardScaler()), ('imputer', SimpleImputer(strategy='median')), ('selector', SelectFromModel(estimator=RandomForestClassifier(), max_features=3))])
B. Pipeline([('selector', SelectFromModel(estimator=RandomForestClassifier(), max_features=3)), ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())])
C. Pipeline([('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()), ('selector', SelectFromModel(estimator=RandomForestClassifier(), max_features=3))])
D. Pipeline([('imputer', SimpleImputer(strategy='mean')), ('selector', SelectFromModel(estimator=RandomForestClassifier(), max_features=3)), ('scaler', StandardScaler())])

Solution

  1. Step 1: Order pipeline steps logically

    Missing values must be handled first, then scaling, then feature selection.
  2. Step 2: Check each option's correctness

    Pipeline([('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()), ('selector', SelectFromModel(estimator=RandomForestClassifier(), max_features=3))]) follows correct order and uses median for imputation. Pipeline([('scaler', StandardScaler()), ('imputer', SimpleImputer(strategy='median')), ('selector', SelectFromModel(estimator=RandomForestClassifier(), max_features=3))]) swaps imputer and scaler incorrectly. Pipeline([('selector', SelectFromModel(estimator=RandomForestClassifier(), max_features=3)), ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())]) starts with selector which needs complete data. Pipeline([('imputer', SimpleImputer(strategy='mean')), ('selector', SelectFromModel(estimator=RandomForestClassifier(), max_features=3)), ('scaler', StandardScaler())]) uses mean instead of median and wrong order.
  3. Final Answer:

    Pipeline([('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()), ('selector', SelectFromModel(estimator=RandomForestClassifier(), max_features=3))]) -> Option C
  4. Quick Check:

    Impute -> scale -> select features [OK]
Hint: Impute missing -> scale -> select features in pipeline order [OK]
Common Mistakes:
  • Placing scaler before imputer
  • Selecting features before imputing missing values
  • Using mean instead of median when median is required