Sample() for random rows in Data Analysis Python - Time & Space Complexity
We want to understand how the time to pick random rows from data grows as the data gets bigger.
How does the sampling time change when the dataset size increases?
Analyze the time complexity of the following code snippet.
import pandas as pd
n = 1000 # example size
k = 3 # example sample size
data = pd.DataFrame({'A': range(n)})
sample_rows = data.sample(k)
This code creates a table with n rows and picks k random rows from it.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Selecting k random rows from n rows.
- How many times: The operation depends on k, the number of rows sampled.
Picking k rows from n rows takes time mostly based on k, not n.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About k operations (e.g., 3) |
| 100 | About k operations (e.g., 3) |
| 1000 | About k operations (e.g., 3) |
Pattern observation: The time grows with k, the sample size, not with n, the total data size.
Time Complexity: O(k)
This means the time to pick random rows grows with how many rows you want, not how big the whole data is.
[X] Wrong: "Sampling random rows takes longer as the whole dataset gets bigger."
[OK] Correct: The sampling method usually picks only the needed rows, so time depends on sample size, not total data size.
Knowing how sampling scales helps you explain efficient data handling in real projects, showing you understand practical data work.
"What if we change k to be a fraction of n (like 10% of n)? How would the time complexity change then?"