0
0
MLOpsdevops~5 mins

Why data versioning is harder than code versioning in MLOps - Why It Works

Choose your learning style9 modes available
Introduction
Tracking changes in data is more difficult than tracking changes in code because data files are often large, binary, and frequently updated. Unlike code, data can be messy, have many versions, and require special tools to manage efficiently.
When you want to keep track of different versions of datasets used in machine learning experiments
When you need to reproduce a model training exactly with the same data snapshot
When multiple team members update or add data and you want to avoid conflicts or data loss
When you want to audit or compare changes in data over time to understand model performance shifts
When you want to store large datasets efficiently without duplicating entire files for every change
Commands
Initialize a Git repository to track code changes. This works well for small text files like code but not for large data files.
Terminal
git init
Expected OutputExpected
Initialized empty Git repository in /home/user/project/.git/
Add a data file to Git tracking. This is possible but inefficient for large or frequently changing data files.
Terminal
git add data.csv
Expected OutputExpected
No output (command runs silently)
Commit the data file to the Git repository. Git stores the entire file each time it changes, which can quickly use a lot of space.
Terminal
git commit -m "Add initial data file"
Expected OutputExpected
[master (root-commit) abc1234] Add initial data file 1 file changed, 0 insertions(+), 0 deletions(-) create mode 100644 data.csv
Initialize DVC (Data Version Control) in the project. DVC is designed to handle large data files efficiently by storing metadata and linking to external storage.
Terminal
dvc init
Expected OutputExpected
Initialized DVC repository. You can now track data files with 'dvc add <filename>'.
Tell DVC to track the data file. DVC creates a small metafile to track the data version without storing the whole file in Git.
Terminal
dvc add data.csv
Expected OutputExpected
Adding 'data.csv' to DVC tracking. Computing checksum... Saving information to 'data.csv.dvc'.
Add the DVC metafile and updated .gitignore to Git. This keeps Git tracking small and efficient.
Terminal
git add data.csv.dvc .gitignore
Expected OutputExpected
No output (command runs silently)
Commit the DVC metafile to Git. The actual data is stored separately, making versioning large files practical.
Terminal
git commit -m "Track data.csv with DVC"
Expected OutputExpected
[master abc5678] Track data.csv with DVC 2 files changed, 10 insertions(+), 2 deletions(-) create mode 100644 data.csv.dvc
Key Concept

If you remember nothing else from this pattern, remember: data files are large and change differently than code, so special tools like DVC are needed to version them efficiently.

Common Mistakes
Trying to version large data files directly with Git
Git stores full copies of files on each change, causing slow performance and huge repositories.
Use data versioning tools like DVC that track data changes efficiently without storing full copies in Git.
Ignoring .gitignore updates when adding data files
Git may try to track large data files directly, causing repository bloat and slow operations.
Update .gitignore to exclude large data files and track only metadata files with Git.
Summary
Git works well for code but is inefficient for large or frequently changing data files.
Data versioning tools like DVC track data changes by storing metadata and linking to external storage.
Combining Git for code and DVC for data keeps repositories small and reproducible.