Data versioning helps keep track of changes in your data over time. It makes sure you can always go back to an earlier version if needed.
0
0
Data versioning (DVC) in ML Python
Introduction
When you want to save different versions of your training data during model development.
When working in a team and need to share consistent data versions.
When you want to reproduce past experiments exactly with the same data.
When your dataset is large and you want to avoid copying it multiple times.
When you want to track changes in data alongside your code.
Syntax
ML Python
dvc init
dvc add <data-file-or-folder>
git add <data-file-or-folder>.dvc .gitignore
git commit -m "Add data version"
dvc push
dvc pull
dvc checkoutdvc init sets up DVC in your project folder.
dvc add tracks your data files or folders.
Examples
Initialize DVC in your project folder to start tracking data.
ML Python
dvc init
Track the training data file
train.csv with DVC.ML Python
dvc add data/train.csv
Save the data tracking info in Git so others can get the same data version.
ML Python
git add data/train.csv.dvc .gitignore
git commit -m "Track training data"Upload your data version to remote storage for backup and sharing.
ML Python
dvc push
Sample Model
This example shows how to initialize DVC, add a data file, and commit the tracking info with Git. It then lists the tracked files.
ML Python
import os import subprocess # Step 1: Initialize DVC in a new folder os.makedirs('my_project', exist_ok=True) os.chdir('my_project') subprocess.run(['git', 'init'], check=True) subprocess.run(['dvc', 'init'], check=True) # Step 2: Create a sample data file with open('data.csv', 'w') as f: f.write('feature,label\n1,0\n2,1\n3,0\n') # Step 3: Track data file with DVC subprocess.run(['dvc', 'add', 'data.csv'], check=True) # Step 4: Add DVC files to git and commit subprocess.run(['git', 'add', 'data.csv.dvc', '.gitignore'], check=True) subprocess.run(['git', 'commit', '-m', 'Add data version'], check=True) # Step 5: Show tracked files tracked_files = subprocess.check_output(['dvc', 'list', '.']).decode() print('Tracked files in DVC:') print(tracked_files)
OutputSuccess
Important Notes
DVC works well with Git to track both code and data versions together.
Use remote storage (like cloud buckets) with dvc remote add to save large data safely.
Always commit your .dvc files and .gitignore changes to keep data versions consistent.
Summary
Data versioning with DVC helps you track and manage changes in datasets easily.
DVC integrates with Git to keep data and code versions in sync.
It supports large data files without copying them multiple times.