K-Means & K-Medoids Clustering

K-Means is the most popular clustering method any learner should know. In this note, we will understand the idea of KMeans and how to use it with Scikit-learn. Besides that, we also learn about its variants (K-medois, K-modes, K-medians).

👉 Metrics for clustering

K-Means

Idea?

Randomly choose centroids ($k$).
Go through each example and assign them to the nearest centroid (assign class of that centroid).
Move each centroid (of each class) to the average of data points having the same class with the centroid.
Repeat 2 and 3 until convergence.

A simply basic steps of K-Means.

A simply basic steps of K-Means.

A gif illustrating the idea of K-Means algorithm. Source.

A gif illustrating the idea of K-Means algorithm. Source.

How to choose k?

Using "Elbow" method to choose the number of clusters $k$.

Untitled

Discussion

A type of Partitioning clustering.
K-means is sensitive to outliers → K-medoids clustering or PAM (Partitioning Around Medoids) is less sensitive to outliers (ref)

K-Means in code

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=10, random_state=0) # default k=8

kmeans.fit(X)
kmeans.predict(X)

# or
kmeans.fit_predict(X)

Some notable parameters (see full):