Data type optimization in Data Analysis Python - Time & Space Complexity
When we change data types in data analysis, it can affect how fast our code runs.
We want to know how these changes affect the time it takes to process data.
Analyze the time complexity of the following code snippet.
import pandas as pd
data = pd.DataFrame({
'numbers': range(1000000),
'floats': [float(x) for x in range(1000000)]
})
data['numbers'] = data['numbers'].astype('int32') # change to smaller integer type
result = data['numbers'].sum()
This code changes a column's data type to a smaller integer type and then sums the values.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Summing all values in the column.
- How many times: Once for each of the 1,000,000 rows.
As the number of rows grows, the sum operation takes longer because it looks at each value once.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | 10 sums |
| 100 | 100 sums |
| 1000 | 1000 sums |
Pattern observation: The time grows directly with the number of rows.
Time Complexity: O(n)
This means the time to sum grows in a straight line as the data size grows.
[X] Wrong: "Changing data types always makes the code run faster."
[OK] Correct: Changing data types can save memory but does not change how many times the code must add values.
Understanding how data size and types affect speed shows you can write efficient data analysis code.
"What if we changed the sum operation to a nested loop over the data? How would the time complexity change?"