Which of the following best explains why clustering algorithms group similar data points together?
Think about how distance or similarity between points affects grouping.
Clustering groups data points that are close to each other in the feature space because they share similar characteristics. This proximity is measured using distance metrics like Euclidean distance.
What is the output labels array after running this clustering code?
from scipy.cluster.hierarchy import fcluster, linkage import numpy as np # Sample data points X = np.array([[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]]) # Perform hierarchical clustering Z = linkage(X, method='single') # Form flat clusters with max distance 3 labels = fcluster(Z, t=3, criterion='distance') print(labels)
Look at how points close in space are grouped with a distance threshold of 3.
The first three points are close together and form cluster 1. The last three points are close and form cluster 2. The labels array reflects these two groups.
Given the same data and linkage matrix, how many clusters are formed when the distance threshold changes?
from scipy.cluster.hierarchy import fcluster, linkage import numpy as np X = np.array([[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]]) Z = linkage(X, method='single') clusters_t2 = fcluster(Z, t=2, criterion='distance') clusters_t5 = fcluster(Z, t=5, criterion='distance') num_clusters_t2 = len(set(clusters_t2)) num_clusters_t5 = len(set(clusters_t5)) print(num_clusters_t2, num_clusters_t5)
Smaller distance thresholds create more clusters; larger thresholds merge clusters.
At threshold 2, three clusters form because points are grouped tightly. At threshold 5, two clusters form as some groups merge.
Which statement correctly describes the dendrogram shown below for hierarchical clustering?
(Imagine a dendrogram with two main branches splitting at a height around 3)
Look at where the branches join and the height to decide cluster groups.
A dendrogram visually represents how clusters merge at different distances. Cutting at height 3 splits data into two main clusters.
What error will this code raise when run?
from scipy.cluster.hierarchy import linkage, fcluster import numpy as np X = np.array([[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]]) Z = linkage(X, method='single') # Incorrect use of fcluster with invalid criterion labels = fcluster(Z, t=3, criterion='invalid') print(labels)
Check the valid options for the 'criterion' parameter in fcluster.
The 'criterion' parameter must be one of 'inconsistent', 'distance', or 'maxclust'. Using 'invalid' raises a ValueError.