Creating DataFrame from NumPy array in Pandas - Performance & Efficiency
We want to understand how the time needed to create a DataFrame from a NumPy array changes as the array gets bigger.
Specifically, how does the work grow when the input size increases?
Analyze the time complexity of the following code snippet.
import numpy as np
import pandas as pd
arr = np.random.rand(1000, 5)
df = pd.DataFrame(arr, columns=[f'col{i}' for i in range(5)])
This code creates a 1000-row, 5-column NumPy array and then converts it into a pandas DataFrame with column names.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Copying or referencing each element from the NumPy array into the DataFrame structure.
- How many times: Once for each element in the array, so total elements = rows x columns.
As the number of rows or columns grows, the time to create the DataFrame grows roughly in proportion to the total number of elements.
| Input Size (rows x columns) | Approx. Operations |
|---|---|
| 10 x 5 = 50 | About 50 operations |
| 100 x 5 = 500 | About 500 operations |
| 1000 x 5 = 5000 | About 5000 operations |
Pattern observation: The work grows linearly with the total number of elements in the array.
Time Complexity: O(n * m)
This means the time to create the DataFrame grows proportionally to the number of rows (n) times the number of columns (m).
[X] Wrong: "Creating a DataFrame from a NumPy array takes constant time regardless of size."
[OK] Correct: The DataFrame must process every element to build its structure, so the time grows with the total number of elements.
Understanding how data size affects processing time helps you write efficient data loading code and explain your choices clearly in interviews.
"What if we changed the input from a NumPy array to a list of lists? How would the time complexity change?"