Why data I/O matters in Pandas - Performance Analysis
Reading and writing data can take a lot of time when working with pandas.
We want to know how the time to load or save data grows as the data gets bigger.
Analyze the time complexity of the following code snippet.
import pandas as pd
df = pd.read_csv('large_file.csv')
df.to_csv('output_file.csv', index=False)
This code reads a CSV file into a DataFrame and then writes it back to a new CSV file.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Reading and writing each row of the file.
- How many times: Once for each row in the file (n times).
As the number of rows grows, the time to read and write grows roughly the same way.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | 10 reads + 10 writes |
| 100 | 100 reads + 100 writes |
| 1000 | 1000 reads + 1000 writes |
Pattern observation: The time grows directly with the number of rows; doubling rows doubles the work.
Time Complexity: O(n)
This means the time to read or write data grows in a straight line with the number of rows.
[X] Wrong: "Reading a file is instant no matter how big it is."
[OK] Correct: The computer must process each row, so bigger files take more time.
Understanding how data input and output time grows helps you write better code and explain performance clearly.
"What if we read the file in chunks instead of all at once? How would the time complexity change?"