Chunked reading for large files in Pandas - Time & Space Complexity
When reading very large files, we often use chunked reading to handle data in parts.
We want to know how the time to read grows as the file size grows.
Analyze the time complexity of the following code snippet.
import pandas as pd
chunk_size = 10000
chunks = []
for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
processed = chunk[chunk['value'] > 0] # filter rows
chunks.append(processed)
result = pd.concat(chunks)
This code reads a large CSV file in chunks, filters rows in each chunk, and combines the results.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Looping over chunks of the file and filtering rows in each chunk.
- How many times: Number of chunks equals total rows divided by chunk size.
As the file size grows, the number of chunks grows proportionally.
| Input Size (rows) | Approx. Operations |
|---|---|
| 10,000 | 1 chunk read and filter |
| 100,000 | 10 chunks read and filter |
| 1,000,000 | 100 chunks read and filter |
Pattern observation: Operations grow linearly with the number of rows in the file.
Time Complexity: O(n)
This means the time to read and process grows directly in proportion to the file size.
[X] Wrong: "Reading in chunks makes the process faster than reading the whole file at once."
[OK] Correct: Chunking helps manage memory but does not reduce total work; total time still grows with file size.
Understanding how chunked reading scales helps you handle big data efficiently and shows you can think about performance practically.
"What if we increased the chunk size to read more rows at once? How would the time complexity change?"