Mann-Whitney U test in SciPy - Time & Space Complexity
We want to understand how the time needed to run the Mann-Whitney U test changes as the size of the input data grows.
Specifically, how does the test's execution time increase when we have more data points?
Analyze the time complexity of the following code snippet.
import numpy as np
from scipy.stats import mannwhitneyu
data1 = np.random.rand(1000)
data2 = np.random.rand(1000)
stat, p = mannwhitneyu(data1, data2)
print(f"U statistic: {stat}, p-value: {p}")
This code runs the Mann-Whitney U test on two arrays of random numbers, each with 1000 elements, to compare their distributions.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Comparing and ranking all combined elements from both input arrays.
- How many times: The algorithm processes all elements together, so it handles about n + m elements, where n and m are the sizes of the two arrays.
As the number of data points increases, the time to rank and compare grows faster than just adding more elements.
| Input Size (n = m) | Approx. Operations |
|---|---|
| 10 | About 20 elements ranked and compared |
| 100 | About 200 elements ranked and compared |
| 1000 | About 2000 elements ranked and compared |
Pattern observation: The work grows roughly proportional to the total number of elements times the log of that number, because sorting is involved.
Time Complexity: O((n + m) log(n + m))
This means the time to run the test grows a bit faster than just the total number of data points, mainly because it sorts all the data combined.
[X] Wrong: "The Mann-Whitney U test runs in linear time because it just compares elements once."
[OK] Correct: The test needs to sort or rank all combined data points, which takes more time than just one pass through the data.
Understanding how sorting affects the time needed for statistical tests like Mann-Whitney U helps you explain your approach clearly and shows you know how data size impacts performance.
"What if one input array is much smaller than the other? How would that affect the time complexity?"