The Math Behind K-Means Clustering (2024)

Dharmaraj

5 min read

Jan 26, 2022

Introduction

K-Means Clustering is an Unsupervised Learning Algorithm, which groups the unlabeled dataset into different clusters. Here K defines the number of pre-defined clusters or groups that need to be created in the process, as if K=5, there will be five clusters, and for K=10, there will be ten clusters, and so on. The k-means clustering algorithm mainly performs two tasks:

Determines the best value for K center points.
Assigns each data point to its closest k-center. Groups assign based on k center points by measuring the distance between k points and data points.

In this blog, we are going to learn about the math behind the K-Means Clustering so if you want to learn how to implement K-Means Clustering please check my other blog here with source code.

K-Means Clustering Algorithm-

K-Means Clustering Algorithm involves the following steps:

Step 1: Calculate the number of K (Clusters).

Step 2: Randomly select K data points as cluster center.

Step 3: Using the Euclidean distance formula measure the distance between each data point and each cluster center.

Step 4: Assign each data point to that cluster whose center is nearest to that data point.

Step 5: Re-compute the center of newly formed clusters. The center of a cluster is computed by taking the mean of all the data points contained in that cluster.

Step 6: Keep repeating the procedure from Step 3 to Step 5 until any of the following stopping criteria is met-

If data points fall in the same cluster
Reached maximum of iteration
The newly formed cluster does not change in center points

Example

Lets consider we have cluster points P1(1,3) , P2(2,2) , P3(5,8) , P4(8,5) , P5(3,9) , P6(10,7) , P7(3,3) , P8(9,4) , P9(3,7).

First, we take our K value as 3 and we assume that our Initial cluster centers are P7(3,3), P9(3,7), P8(9,4) as C1, C2, C3. We will find out the new centroids after 2 iterations for the above data points.

Step 1

Find the distance between data points and Centroids. which data points have a minimum distance that points moved to the nearest cluster centroid.

Iteration 1

Calcualte the distance between data points and K (C1,C2,C3)

C1P1 =>(3,3)(1,3) => sqrt[(1–3)²+(3–3)²] => sqrt[4] =>2

C2P1 =>(3,7)(1,3)=> sqrt[(1–3)²+(3–7)²] => sqrt[20] =>4.5

C3P1 =>(9,4)(1,3) => sqrt[(1–9)²+(3–4)²]=> sqrt[65] =>8.1

For P2,

C1P2 =>(3,3)(2,2) => sqrt[(2–3)²+(2–3)²] => sqrt[2] =>1.4

C2P2 =>(3,7)(2,2)=> sqrt[(2–3)²+(2–7)²] => sqrt[26] =>5.1

C3P2 =>(9,4)(2,2) => sqrt[(2–9)²+(2–4)²]=> sqrt[53] =>7.3

For P3,

C1P2 =>(3,3)(5,8) => sqrt[(5–3)²+(8–3)²] => sqrt[29] =>5.3

C2P2 =>(3,7)(5,8)=> sqrt[(5–3)²+(8–7)²] => sqrt[5] =>2.2

C3P2 =>(9,4)(5,8) => sqrt[(5–9)²+(8–4)²]=> sqrt[32] =>5.7

Similarly for other distances..

Cluster 1 => P1(1,3) , P2(2,2) , P7(3,3)

Cluster 2 => P3(5,8) , P5(3,9) , P9(3,7)

Cluster 3 => P4(8,5) , P6(10,7) , P8(9,4)

Now, We re-compute the new clusters and the new cluster center is computed by taking the mean of all the points contained in that particular cluster.

New center of Cluster 1 => (1+2+3)/3 , (3+2+3)/3 => 2,2.7

New center of Cluster 2 => (5+3+3)/3 , (8+9+7)/3 => 3.7,8

New center of Cluster 3 => (8+10+9)/3 , (5+7+4)/3 => 9,5.3

Iteration 1 is over. Now, let us take our new center points and repeat the same steps which are to calculate the distance between data points and new center points with the Euclidean formula and find cluster groups.

Iteration 2

Calcualte the distance between data points and K (C1,C2,C3)

C1(2,2.7) , C2(3.7,8) , C3(9,5.3)

C1P1 =>(2,2.7)(1,3) => sqrt[(1–2)²+(3–2.7)²] => sqrt[1.1] =>1.0

C2P1 =>(3.7,8)(1,3)=> sqrt[(1–3.7)²+(3–8)²] => sqrt[32.29] =>4.5

C3P1 =>(9,5.3)(1,3) => sqrt[(1–9)²+(3–5.3)²]=> sqrt[69.29] =>8.3

Similarly for other distances..

Cluster 1 => P1(1,3) , P2(2,2) , P7(3,3)

Cluster 2 => P3(5,8) , P5(3,9) , P9(3,7)

Cluster 3 => P4(8,5) , P6(10,7) , P8(9,4)

Center of Cluster 1 => (1+2+3)/3 , (3+2+3)/3 => 2,2.7

Center of Cluster 2 => (5+3+3)/3 , (8+9+7)/3 => 3.7,8

Center of Cluster 3 => (8+10+9)/3 , (5+7+4)/3 => 9,5.3

We got the same centroid and cluster groups which indicates that this dataset has only 2 groups. K-Means clustering stops iteration because of the same cluster repeating so no need to continue iteration and display the last iteration as the best cluster groups for this dataset.

The Below graph explained the difference between iterations 1 and 2. We can see centroids (green dot) changed in the 2nd Iteration.

Conclusion

I hope you are clear about the math steps behind the K-Means Clustering. In this blog, we took a small number for the dataset so we are given a k value is 3 and took 2 iterations. In real-time, the dataset feature will be maximum in that case we should use the Elbow method (WCSS) to get the perfect K(Cluster groups) value. Check out some of my previous blogs:

Implementation of K-Means ClusteringIntroductionmedium.com

Linear Regression - LSMLinear Regression is used to find the relationship between a dependent variable and the independent variable. There are…medium.com

Implementation of Linear Regression using PythonIntroductionmedium.com

Have doubts? Need help? Contact me!

LinkedIn: https://www.linkedin.com/in/dharmaraj-d-1b707898

Github: https://github.com/DharmarajPi