0
0
MLOpsdevops~5 mins

DVC (Data Version Control) basics in MLOps - Commands & Configuration

Choose your learning style9 modes available
Introduction
When working with machine learning projects, managing data and model versions is hard. DVC helps track data files and models alongside code, making it easy to reproduce experiments and share results.
When you want to keep track of large datasets without storing them directly in Git.
When you need to share data and models with your team while keeping versions organized.
When you want to reproduce machine learning experiments exactly with the same data and code.
When you want to avoid mixing code changes with data changes in your version control.
When you want to automate data pipeline steps and track their outputs.
Commands
This command initializes DVC in your project folder. It creates necessary config files and folders to start tracking data and models.
Terminal
dvc init
Expected OutputExpected
Initialized DVC repository. You can now track data files with `dvc add`.
This command tells DVC to track the dataset.csv file. It creates a small pointer file and stores the actual data in DVC cache.
Terminal
dvc add data/dataset.csv
Expected OutputExpected
Adding 'data/dataset.csv' to DVC. Computing md5 hash: 123abc456def7890 Saving to cache: .dvc/cache/12/3abc456def7890 To track this file, commit the changes to Git.
Add the DVC pointer file and updated .gitignore to Git. This keeps track of the data version without storing the actual data in Git.
Terminal
git add data/dataset.csv.dvc .gitignore
Expected OutputExpected
No output (command runs silently)
Commit the changes to Git so the data version is linked with your code version.
Terminal
git commit -m "Track dataset.csv with DVC"
Expected OutputExpected
[main abc1234] Track dataset.csv with DVC 2 files changed, 10 insertions(+) create mode 100644 data/dataset.csv.dvc
Upload the actual data files tracked by DVC to remote storage (like cloud or shared server) so others can access them.
Terminal
dvc push
Expected OutputExpected
Uploading data/dataset.csv to remote storage. 100%|███████████████████████████████████████| 1.00M/1.00M [00:01<00:00, 1.00MB/s]
Key Concept

If you remember nothing else, remember: DVC tracks large data files separately from code, linking them with small pointer files in Git for easy versioning and sharing.

Common Mistakes
Adding large data files directly to Git instead of using dvc add.
Git is not designed for large files and will slow down or bloat your repository.
Use 'dvc add' to track large files and commit only the small .dvc pointer files to Git.
Forgetting to run 'dvc push' after adding data files.
Data files remain only in local cache and are not shared with team or remote storage.
Always run 'dvc push' to upload data files to remote storage after adding or updating them.
Not committing the .dvc files to Git after running 'dvc add'.
Without committing .dvc files, the data version is not tracked in Git and cannot be reproduced.
Add and commit the .dvc files and .gitignore changes to Git after 'dvc add'.
Summary
Initialize DVC in your project with 'dvc init' to start tracking data.
Use 'dvc add' to track large data files and create pointer files.
Commit the .dvc pointer files and .gitignore changes to Git to link data versions with code.
Run 'dvc push' to upload data files to remote storage for sharing and backup.