str.split() for splitting in Pandas - Time & Space Complexity
We want to understand how the time needed to split strings in a pandas column changes as the number of rows grows.
How does the work increase when we have more data to split?
Analyze the time complexity of the following code snippet.
import pandas as pd
data = {'names': ['Alice Smith', 'Bob Jones', 'Charlie Brown', 'David Wilson'] * 1000}
df = pd.DataFrame(data)
df['first_name'] = df['names'].str.split().str[0]
This code splits each full name in the 'names' column by spaces and extracts the first part as 'first_name'.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Splitting each string in the column by spaces.
- How many times: Once for every row in the DataFrame.
As the number of rows increases, the total splitting work grows proportionally.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | 10 splits |
| 100 | 100 splits |
| 1000 | 1000 splits |
Pattern observation: Doubling the rows doubles the number of splits needed.
Time Complexity: O(n)
This means the time to split strings grows directly with the number of rows.
[X] Wrong: "Splitting strings in a column happens instantly no matter how many rows there are."
[OK] Correct: Each row requires its own split operation, so more rows mean more work and more time.
Understanding how string operations scale helps you write efficient data processing code and explain your choices clearly.
"What if we split only the first 5 rows instead of the whole column? How would the time complexity change?"