Challenge - 5 Problems
DVC Pipeline Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
💻 Command Output
intermediate2:00remaining
DVC Pipeline Stage Output
You run the command
dvc run -n preprocess -d data/raw.csv -o data/processed.csv python preprocess.py. What will be the output of dvc pipeline show immediately after?Attempts:
2 left
💡 Hint
Think about what stages are created after running dvc run.
✗ Incorrect
The command creates a stage named 'preprocess'. Running 'dvc pipeline show' lists existing stages. Since only 'preprocess' is created, it shows just that.
❓ Configuration
intermediate2:00remaining
Correct DVC Stage Definition in dvc.yaml
Which of the following
dvc.yaml stage definitions correctly specifies a stage named 'train' that depends on 'data/processed.csv' and runs 'python train.py' producing 'model.pkl'?Attempts:
2 left
💡 Hint
Remember dependencies are inputs, outputs are results.
✗ Incorrect
Option A correctly lists 'data/processed.csv' as a dependency and 'model.pkl' as an output. The command is correct.
❓ Troubleshoot
advanced2:00remaining
DVC Pipeline Reproduction Issue
You modified 'preprocess.py' but running
dvc repro does not rerun the 'preprocess' stage. What is the most likely cause?Attempts:
2 left
💡 Hint
DVC tracks changes only in declared dependencies.
✗ Incorrect
If 'preprocess.py' is not declared as a dependency, DVC does not detect changes in it and skips rerunning the stage.
🔀 Workflow
advanced2:00remaining
DVC Pipeline Stage Execution Order
Given a pipeline with stages: 'download' -> 'preprocess' -> 'train', which command will reproduce only the 'train' stage and all its dependencies?
Attempts:
2 left
💡 Hint
Reproducing a stage also reproduces its dependencies.
✗ Incorrect
Running 'dvc repro train' will reproduce 'train' and all stages it depends on, including 'preprocess' and 'download'.
✅ Best Practice
expert3:00remaining
Best Practice for Large Data Files in DVC Pipelines
You have very large raw data files that rarely change but are needed for multiple pipeline runs. What is the best practice to manage these files with DVC?
Attempts:
2 left
💡 Hint
Think about versioning large files efficiently without bloating Git.
✗ Incorrect
DVC is designed to track large files efficiently by storing them in remote storage and keeping lightweight pointers in Git. This avoids bloating Git repositories.