0
0
ML Pythonml~5 mins

Data versioning (DVC) in ML Python

Choose your learning style9 modes available
Introduction

Data versioning helps keep track of changes in your data over time. It makes sure you can always go back to an earlier version if needed.

When you want to save different versions of your training data during model development.
When working in a team and need to share consistent data versions.
When you want to reproduce past experiments exactly with the same data.
When your dataset is large and you want to avoid copying it multiple times.
When you want to track changes in data alongside your code.
Syntax
ML Python
dvc init

dvc add <data-file-or-folder>

git add <data-file-or-folder>.dvc .gitignore

git commit -m "Add data version"

dvc push

dvc pull

dvc checkout

dvc init sets up DVC in your project folder.

dvc add tracks your data files or folders.

Examples
Initialize DVC in your project folder to start tracking data.
ML Python
dvc init
Track the training data file train.csv with DVC.
ML Python
dvc add data/train.csv
Save the data tracking info in Git so others can get the same data version.
ML Python
git add data/train.csv.dvc .gitignore
git commit -m "Track training data"
Upload your data version to remote storage for backup and sharing.
ML Python
dvc push
Sample Model

This example shows how to initialize DVC, add a data file, and commit the tracking info with Git. It then lists the tracked files.

ML Python
import os
import subprocess

# Step 1: Initialize DVC in a new folder
os.makedirs('my_project', exist_ok=True)
os.chdir('my_project')
subprocess.run(['git', 'init'], check=True)
subprocess.run(['dvc', 'init'], check=True)

# Step 2: Create a sample data file
with open('data.csv', 'w') as f:
    f.write('feature,label\n1,0\n2,1\n3,0\n')

# Step 3: Track data file with DVC
subprocess.run(['dvc', 'add', 'data.csv'], check=True)

# Step 4: Add DVC files to git and commit
subprocess.run(['git', 'add', 'data.csv.dvc', '.gitignore'], check=True)
subprocess.run(['git', 'commit', '-m', 'Add data version'], check=True)

# Step 5: Show tracked files
tracked_files = subprocess.check_output(['dvc', 'list', '.']).decode()
print('Tracked files in DVC:')
print(tracked_files)
OutputSuccess
Important Notes

DVC works well with Git to track both code and data versions together.

Use remote storage (like cloud buckets) with dvc remote add to save large data safely.

Always commit your .dvc files and .gitignore changes to keep data versions consistent.

Summary

Data versioning with DVC helps you track and manage changes in datasets easily.

DVC integrates with Git to keep data and code versions in sync.

It supports large data files without copying them multiple times.