Jump into concepts and practice - no test required
or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Reproducible training pipelines
📖 Scenario: You are working as a machine learning engineer. Your team wants to create a training pipeline that always produces the same results when given the same data and code. This helps avoid surprises and makes debugging easier.To do this, you will build a simple reproducible training pipeline step-by-step.
🎯 Goal: Build a reproducible training pipeline that loads data, sets a fixed random seed, trains a simple model, and prints the model accuracy. This pipeline should produce the same accuracy every time it runs.
📋 What You'll Learn
Create a dataset variable with fixed data
Set a fixed random seed for reproducibility
Train a simple model using the fixed data and seed
Print the model accuracy as the final output
💡 Why This Matters
🌍 Real World
Reproducible training pipelines help teams avoid bugs and inconsistencies when training machine learning models repeatedly.
💼 Career
Understanding reproducibility is essential for ML engineers and data scientists to build reliable and trustworthy models.
Progress0 / 4 steps
1
Create the dataset
Create a variable called data that is a list of tuples with these exact entries: ([0, 0], 0), ([1, 1], 1), ([1, 0], 1), ([0, 1], 0).
MLOps
Hint
Use a list of tuples where each tuple has a list of features and a label.
2
Set a fixed random seed
Import the random module and set the random seed to 42 using random.seed(42).
MLOps
Hint
Use import random at the top and then random.seed(42) to fix randomness.
3
Train a simple model
Create a function called train_model that takes data as input and returns a dictionary model with keys 'threshold' set to 0.5. Then call train_model(data) and save the result in a variable called model.
MLOps
Hint
Define a function that returns a fixed model dictionary and call it.
4
Print the model accuracy
Calculate the accuracy by comparing the model's prediction with the true label for each data point. Use the rule: predict 1 if sum of features >= model['threshold'], else 0. Print the accuracy as a float with two decimals using print(f"Accuracy: {accuracy:.2f}").
MLOps
Hint
Loop over data, predict using threshold, count correct predictions, then print accuracy.
Practice
(1/5)
1. What is the main goal of a reproducible training pipeline in MLOps?
easy
A. To ensure the training process produces the same results every time
B. To speed up the training by skipping steps
C. To use different data each time for variety
D. To manually adjust parameters during training
Solution
Step 1: Understand reproducibility meaning
Reproducibility means getting the same output when running the same process multiple times.
Step 2: Apply to training pipelines
In training pipelines, reproducibility ensures consistent model results every run.
Final Answer:
To ensure the training process produces the same results every time -> Option A
Quick Check:
Reproducibility = Same results every time [OK]
Hint: Reproducible means repeatable with same results [OK]
Common Mistakes:
Thinking reproducible means faster training
Assuming data changes each run
Believing manual tweaks improve reproducibility
2. Which of the following is the correct way to specify a fixed random seed in a Python training script for reproducibility?
easy
A. seed.random(42)
B. random.set_seed(42)
C. random.seed(42)
D. set.seed(42)
Solution
Step 1: Recall Python random module syntax
Python's random module uses random.seed(value) to fix the seed.
Step 2: Check each option
Only random.seed(42) matches correct Python syntax.
Final Answer:
random.seed(42) -> Option C
Quick Check:
Python random seed = random.seed() [OK]
Hint: Python random seed uses random.seed(value) [OK]
Common Mistakes:
Using incorrect function names like set_seed
Swapping argument order
Confusing with other languages' syntax
3. Given this snippet in a training pipeline script:
import random
random.seed(123)
print(random.randint(1, 10))
random.seed(123)
print(random.randint(1, 10))
What will be the output?
medium
A. Two different random numbers between 1 and 10
B. The same number printed twice
C. An error because seed is set twice
D. Two zeros printed
Solution
Step 1: Understand random.seed effect
Setting random.seed(123) resets the random number generator to a fixed state.
Step 2: Analyze the two prints
Both calls to random.randint(1, 10) after resetting seed produce the same number.
Final Answer:
The same number printed twice -> Option B
Quick Check:
Reset seed = repeat random number [OK]
Hint: Resetting seed repeats random numbers [OK]
Common Mistakes:
Assuming different numbers after resetting seed
Expecting error from multiple seed calls
Thinking zeros are default output
4. You have a training pipeline that uses a Docker container but results differ each run. Which fix will help make it reproducible?
medium
A. Add a fixed random seed in the training code
B. Remove Docker and run on host directly
C. Use different data each time to test robustness
D. Increase batch size to speed training
Solution
Step 1: Identify cause of non-reproducibility
Randomness in training causes different results unless fixed.
Step 2: Apply fixed random seed
Adding a fixed seed ensures same random choices each run, making results reproducible.
Final Answer:
Add a fixed random seed in the training code -> Option A
Quick Check:
Fixed seed fixes randomness [OK]
Hint: Fix randomness with a seed, not by removing Docker [OK]
Common Mistakes:
Thinking Docker causes randomness
Changing data to fix reproducibility
Adjusting batch size unrelated to reproducibility
5. In a complex training pipeline, which combination ensures reproducibility across different machines?
1. Fixed random seeds in code
2. Containerized environment with exact dependencies
3. Using latest library versions without version control
4. Logging all hyperparameters and data versions
Choose the best combination.
hard
A. 2 and 3 only
B. 1 and 3 only
C. All four steps
D. 1, 2, and 4 only
Solution
Step 1: Evaluate each step's impact
Fixed seeds, containerized environments, and logging parameters help reproducibility.
Step 2: Identify problematic step
Using latest libraries without version control can cause differences across machines.
Final Answer:
1, 2, and 4 only -> Option D
Quick Check:
Exclude uncontrolled library versions for reproducibility [OK]
Hint: Control seeds, environment, and logs; avoid uncontrolled versions [OK]
Common Mistakes:
Including latest libraries without version control