0
0
NumPydata~5 mins

Working with large files efficiently in NumPy - Time & Space Complexity

Choose your learning style9 modes available
Time Complexity: Working with large files efficiently
O(n)
Understanding Time Complexity

When working with large files using numpy, it is important to understand how the time to process data grows as the file size increases.

We want to know how the time needed changes when we read and process more data.

Scenario Under Consideration

Analyze the time complexity of the following code snippet.

import numpy as np

chunk_size = 100000
results = []

with open('large_file.csv', 'r') as file:
    while True:
        lines = []
        for _ in range(chunk_size):
            line = file.readline()
            if not line:
                break
            lines.append(line)
        if not lines:
            break
        data = np.genfromtxt(lines, delimiter=',')
        results.append(np.mean(data, axis=0))

This code reads a large CSV file in chunks, converts each chunk to a numpy array, and calculates the mean of each chunk.

Identify Repeating Operations

Identify the loops, recursion, array traversals that repeat.

  • Primary operation: Reading chunks of lines and processing each chunk with numpy.
  • How many times: The loop runs approximately (total lines / chunk_size) times.
How Execution Grows With Input

As the file size grows, the number of chunks increases, so the total processing time grows roughly in direct proportion to the file size.

Input Size (lines)Approx. Operations (chunks)
100,0001
1,000,00010
10,000,000100

Pattern observation: Doubling the file size roughly doubles the number of chunks and total work.

Final Time Complexity

Time Complexity: O(n)

This means the time to process grows linearly with the number of lines in the file.

Common Mistake

[X] Wrong: "Reading the file in chunks makes the time complexity constant no matter the file size."

[OK] Correct: Even with chunks, you still read and process every line, so the total time grows with file size.

Interview Connect

Understanding how reading and processing large files scales helps you handle real data efficiently and shows you can think about performance in practical tasks.

Self-Check

"What if we used memory mapping (np.memmap) instead of reading chunks? How would the time complexity change?"