Working with large files efficiently in NumPy - Time & Space Complexity
When working with large files using numpy, it is important to understand how the time to process data grows as the file size increases.
We want to know how the time needed changes when we read and process more data.
Analyze the time complexity of the following code snippet.
import numpy as np
chunk_size = 100000
results = []
with open('large_file.csv', 'r') as file:
while True:
lines = []
for _ in range(chunk_size):
line = file.readline()
if not line:
break
lines.append(line)
if not lines:
break
data = np.genfromtxt(lines, delimiter=',')
results.append(np.mean(data, axis=0))
This code reads a large CSV file in chunks, converts each chunk to a numpy array, and calculates the mean of each chunk.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Reading chunks of lines and processing each chunk with numpy.
- How many times: The loop runs approximately (total lines / chunk_size) times.
As the file size grows, the number of chunks increases, so the total processing time grows roughly in direct proportion to the file size.
| Input Size (lines) | Approx. Operations (chunks) |
|---|---|
| 100,000 | 1 |
| 1,000,000 | 10 |
| 10,000,000 | 100 |
Pattern observation: Doubling the file size roughly doubles the number of chunks and total work.
Time Complexity: O(n)
This means the time to process grows linearly with the number of lines in the file.
[X] Wrong: "Reading the file in chunks makes the time complexity constant no matter the file size."
[OK] Correct: Even with chunks, you still read and process every line, so the total time grows with file size.
Understanding how reading and processing large files scales helps you handle real data efficiently and shows you can think about performance in practical tasks.
"What if we used memory mapping (np.memmap) instead of reading chunks? How would the time complexity change?"