Why DataFrame creation matters in Pandas - Performance Analysis
When we create a DataFrame in pandas, the time it takes can change depending on the data size.
We want to know how the time to build a DataFrame grows as we add more data.
Analyze the time complexity of the following code snippet.
import pandas as pd
n = 1000 # example value for n
data = {f'col{i}': range(n) for i in range(5)}
df = pd.DataFrame(data)
This code creates a DataFrame with 5 columns and n rows, where each column is a range of numbers.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Creating each column with n elements and assembling them into a DataFrame.
- How many times: The operation repeats for each of the 5 columns, each with n elements.
As n grows, the time to create each column grows linearly, and assembling all columns grows with total data size.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 50 operations (5 columns x 10 rows) |
| 100 | About 500 operations |
| 1000 | About 5000 operations |
Pattern observation: The operations grow roughly in direct proportion to the number of rows times columns.
Time Complexity: O(n)
This means the time to create the DataFrame grows linearly with the number of rows.
[X] Wrong: "Creating a DataFrame is always instant, no matter how big the data is."
[OK] Correct: The time depends on how many rows and columns you have; bigger data takes more time to build.
Understanding how DataFrame creation time grows helps you write efficient data loading and processing code.
"What if we increased the number of columns instead of rows? How would the time complexity change?"