Why data versioning is harder than code versioning in MLOps - Performance Analysis
We want to understand why managing versions of data takes more effort than managing code versions.
How does the work needed grow when data size increases compared to code?
Analyze the time complexity of the following data versioning process.
def save_data_version(data):
for record in data:
store_record(record)
update_metadata(data.id)
This code saves each record of a data set as a new version and updates metadata.
Look for repeated actions that take most time.
- Primary operation: Looping over each data record to store it.
- How many times: Once for every record in the data set.
As data size grows, the work grows too because each record is handled separately.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | 10 store operations + 1 metadata update |
| 100 | 100 store operations + 1 metadata update |
| 1000 | 1000 store operations + 1 metadata update |
Pattern observation: The number of operations grows directly with data size.
Time Complexity: O(n)
This means the time needed grows in a straight line as data size grows.
[X] Wrong: "Data versioning is as simple and fast as code versioning because both just save changes."
[OK] Correct: Data is usually much larger and must be stored record by record, making it slower and more complex than saving code files.
Understanding how data size affects versioning effort shows you can think about real-world challenges in managing machine learning projects.
"What if we only saved changes (deltas) instead of full data copies? How would the time complexity change?"