How to version data for ML

Ml-pythonHow-ToBeginner · 4 min read

How to Version Data for Machine Learning Projects

To version data for ML, use data versioning tools like DVC or Git LFS that track dataset changes alongside code. This helps keep datasets organized, reproducible, and easy to update during model development.

📐

Syntax

Data versioning typically involves commands to add, track, and update datasets in a version control system designed for large files.

For example, with DVC:

dvc init: Initialize data versioning in your project.
dvc add <data-file>: Track a data file or folder.
git add . and git commit -m "message": Commit changes including data pointers.
dvc push: Upload data to remote storage.

bash

dvc init
dvc add data/train.csv
git add data/train.csv.dvc .gitignore
git commit -m "Add training data version"
dvc push

💻

Example

This example shows how to version a CSV dataset using DVC in a simple ML project. It tracks the data file, commits the pointer to Git, and pushes the data to remote storage.

python

import os
import subprocess

# Step 1: Initialize DVC in the project folder
subprocess.run(["dvc", "init"], check=True)

# Step 2: Create a sample data file
os.makedirs("data", exist_ok=True)
with open("data/train.csv", "w") as f:
    f.write("feature,label\n1,0\n2,1\n3,0\n")

# Step 3: Add data file to DVC tracking
subprocess.run(["dvc", "add", "data/train.csv"], check=True)

# Step 4: Commit changes to Git
subprocess.run(["git", "add", "data/train.csv.dvc", ".gitignore"], check=True)
subprocess.run(["git", "commit", "-m", "Add training data version"], check=True)

# Step 5: Push data to remote storage (requires remote setup)
# subprocess.run(["dvc", "push"], check=True)

print("Data versioning setup complete.")

Output

Data versioning setup complete.

⚠️

Common Pitfalls

Not tracking data files: Forgetting to add data files to the versioning tool leads to missing dataset versions.
Committing large data directly to Git: Git is not designed for big files; use tools like DVC or Git LFS instead.
Not setting up remote storage: Without remote storage, data versions are only local and not shareable.
Ignoring data schema changes: Changes in data format or columns should be documented and versioned carefully.

bash

## Wrong way: committing large data directly to Git
# git add data/train.csv
# git commit -m "Add large data file"

## Right way: use DVC to track data
# dvc add data/train.csv
# git add data/train.csv.dvc .gitignore
# git commit -m "Track data with DVC"

📊

Quick Reference

Command	Purpose
dvc init	Initialize DVC in your project
dvc add	Track a data file or folder
git add	Stage changes including DVC pointers
git commit -m "msg"	Commit changes to Git
dvc push	Upload data to remote storage
dvc pull	Download data from remote storage

✅

Key Takeaways

Use specialized tools like DVC or Git LFS to version large ML datasets efficiently.

Always track data changes alongside code to ensure reproducibility.

Set up remote storage to share and backup dataset versions.

Avoid committing large data files directly to Git to prevent repository bloat.

Document data schema and format changes as part of versioning.