When to use apply vs vectorized operations in Pandas - Performance Comparison
We want to understand how the time to run pandas code changes when using apply versus vectorized operations.
Which method grows faster as data size increases?
Analyze the time complexity of these two approaches.
import pandas as pd
n = 10 # Define n before using it
df = pd.DataFrame({'A': range(n)})
# Using apply
result_apply = df['A'].apply(lambda x: x * 2)
# Using vectorized operation
result_vec = df['A'] * 2
This code multiplies each value in column 'A' by 2 using two methods: apply with a lambda, and a vectorized operation.
Look at what repeats as the data grows.
- Primary operation: For
apply, the lambda function runs once per row. - How many times: Exactly
ntimes, wherenis the number of rows. - Vectorized operation: Uses optimized internal code that processes all rows together.
- How many times: Effectively once, working on the whole column at once.
As the number of rows grows, the time for apply grows linearly because it runs the function on each row.
Vectorized operations grow much slower because they use fast, low-level code that handles many rows at once.
| Input Size (n) | Approx. Operations for apply | Approx. Operations for vectorized |
|---|---|---|
| 10 | 10 function calls | 1 fast operation |
| 100 | 100 function calls | 1 fast operation |
| 1000 | 1000 function calls | 1 fast operation |
Pattern observation: apply time grows directly with data size; vectorized stays mostly constant and is much faster.
Time Complexity: O(n) for apply, O(1) for vectorized operations
This means apply takes longer as data grows, while vectorized operations handle large data efficiently.
[X] Wrong: "Using apply is just as fast as vectorized operations because both process all rows."
[OK] Correct: apply runs Python code for each row, which is slower, while vectorized operations use optimized compiled code that works on all data at once.
Knowing when to use vectorized operations versus apply shows you understand efficient data handling, a key skill in data science work.
What if the function inside apply was a built-in NumPy function instead of a lambda? How would that affect the time complexity?