0
0
SciPydata~5 mins

K-means via scipy vs scikit-learn - Performance Comparison

Choose your learning style9 modes available
Time Complexity: K-means via scipy vs scikit-learn
O(n x k x i)
Understanding Time Complexity

We want to understand how the time it takes to run K-means clustering grows as we use more data points or clusters.

This helps us know how fast or slow the algorithm will be when using scipy compared to scikit-learn.

Scenario Under Consideration

Analyze the time complexity of this K-means clustering code using scipy.


from scipy.cluster.vq import kmeans
import numpy as np

# Generate sample data
data = np.random.rand(1000, 2)

# Run K-means clustering
centroids, distortion = kmeans(data, 5, iter=20)
    

This code runs K-means on 1000 points with 5 clusters and up to 20 iterations.

Identify Repeating Operations

Look at what repeats in the algorithm:

  • Primary operation: Assigning each data point to the nearest cluster center.
  • How many times: This happens every iteration, up to 20 times here.
  • Also, updating cluster centers after assignments repeats each iteration.
How Execution Grows With Input

As we increase data points or clusters, the work grows like this:

Input Size (n points)Approx. Operations
1010 points x 5 clusters x 20 iterations = 1,000
100100 x 5 x 20 = 10,000
10001000 x 5 x 20 = 100,000

Pattern observation: The operations grow roughly in a straight line with the number of points and clusters multiplied by iterations.

Final Time Complexity

Time Complexity: O(n x k x i)

This means the time grows proportionally with the number of points (n), clusters (k), and iterations (i).

Common Mistake

[X] Wrong: "The time depends only on the number of data points."

[OK] Correct: The number of clusters and iterations also multiply the work, so ignoring them misses important parts of the cost.

Interview Connect

Understanding how K-means scales helps you explain algorithm choices clearly and shows you can think about performance in real projects.

Self-Check

"What if we reduce the number of iterations by half? How would the time complexity change?"