The Math Behind K-Means Clustering (2024)

The Math Behind K-Means Clustering (2)

Introduction

K-Means Clustering is an Unsupervised Learning Algorithm, which groups the unlabeled dataset into different clusters. Here K defines the number of pre-defined clusters or groups that need to be created in the process, as if K=5, there will be five clusters, and for K=10, there will be ten clusters, and so on. The k-means clustering algorithm mainly performs two tasks:

  • Determines the best value for K center points.
  • Assigns each data point to its closest k-center. Groups assign based on k center points by measuring the distance between k points and data points.

In this blog, we are going to learn about the math behind the K-Means Clustering so if you want to learn how to implement K-Means Clustering please check my other blog here with source code.

K-Means Clustering Algorithm-

K-Means Clustering Algorithm involves the following steps:

Step 1: Calculate the number of K (Clusters).

Step 2: Randomly select K data points as cluster center.

Step 3: Using the Euclidean distance formula measure the distance between each data point and each cluster center.

The Math Behind K-Means Clustering (3)

Step 4: Assign each data point to that cluster whose center is nearest to that data point.

Step 5: Re-compute the center of newly formed clusters. The center of a cluster is computed by taking the mean of all the data points contained in that cluster.

Step 6: Keep repeating the procedure from Step 3 to Step 5 until any of the following stopping criteria is met-

  • If data points fall in the same cluster
  • Reached maximum of iteration
  • The newly formed cluster does not change in center points

Example

Lets consider we have cluster points P1(1,3) , P2(2,2) , P3(5,8) , P4(8,5) , P5(3,9) , P6(10,7) , P7(3,3) , P8(9,4) , P9(3,7).

First, we take our K value as 3 and we assume that our Initial cluster centers are P7(3,3), P9(3,7), P8(9,4) as C1, C2, C3. We will find out the new centroids after 2 iterations for the above data points.

Step 1

Find the distance between data points and Centroids. which data points have a minimum distance that points moved to the nearest cluster centroid.

The Math Behind K-Means Clustering (4)

Iteration 1

Calcualte the distance between data points and K (C1,C2,C3)

C1P1 =>(3,3)(1,3) => sqrt[(1–3)²+(3–3)²] => sqrt[4] =>2

C2P1 =>(3,7)(1,3)=> sqrt[(1–3)²+(3–7)²] => sqrt[20] =>4.5

C3P1 =>(9,4)(1,3) => sqrt[(1–9)²+(3–4)²]=> sqrt[65] =>8.1

For P2,

C1P2 =>(3,3)(2,2) => sqrt[(2–3)²+(2–3)²] => sqrt[2] =>1.4

C2P2 =>(3,7)(2,2)=> sqrt[(2–3)²+(2–7)²] => sqrt[26] =>5.1

C3P2 =>(9,4)(2,2) => sqrt[(2–9)²+(2–4)²]=> sqrt[53] =>7.3

For P3,

C1P2 =>(3,3)(5,8) => sqrt[(5–3)²+(8–3)²] => sqrt[29] =>5.3

C2P2 =>(3,7)(5,8)=> sqrt[(5–3)²+(8–7)²] => sqrt[5] =>2.2

C3P2 =>(9,4)(5,8) => sqrt[(5–9)²+(8–4)²]=> sqrt[32] =>5.7

Similarly for other distances..

The Math Behind K-Means Clustering (5)

Cluster 1 => P1(1,3) , P2(2,2) , P7(3,3)

Cluster 2 => P3(5,8) , P5(3,9) , P9(3,7)

Cluster 3 => P4(8,5) , P6(10,7) , P8(9,4)

Now, We re-compute the new clusters and the new cluster center is computed by taking the mean of all the points contained in that particular cluster.

New center of Cluster 1 => (1+2+3)/3 , (3+2+3)/3 => 2,2.7

New center of Cluster 2 => (5+3+3)/3 , (8+9+7)/3 => 3.7,8

New center of Cluster 3 => (8+10+9)/3 , (5+7+4)/3 => 9,5.3

Iteration 1 is over. Now, let us take our new center points and repeat the same steps which are to calculate the distance between data points and new center points with the Euclidean formula and find cluster groups.

Iteration 2

Calcualte the distance between data points and K (C1,C2,C3)

C1(2,2.7) , C2(3.7,8) , C3(9,5.3)

C1P1 =>(2,2.7)(1,3) => sqrt[(1–2)²+(3–2.7)²] => sqrt[1.1] =>1.0

C2P1 =>(3.7,8)(1,3)=> sqrt[(1–3.7)²+(3–8)²] => sqrt[32.29] =>4.5

C3P1 =>(9,5.3)(1,3) => sqrt[(1–9)²+(3–5.3)²]=> sqrt[69.29] =>8.3

Similarly for other distances..

The Math Behind K-Means Clustering (6)

Cluster 1 => P1(1,3) , P2(2,2) , P7(3,3)

Cluster 2 => P3(5,8) , P5(3,9) , P9(3,7)

Cluster 3 => P4(8,5) , P6(10,7) , P8(9,4)

Center of Cluster 1 => (1+2+3)/3 , (3+2+3)/3 => 2,2.7

Center of Cluster 2 => (5+3+3)/3 , (8+9+7)/3 => 3.7,8

Center of Cluster 3 => (8+10+9)/3 , (5+7+4)/3 => 9,5.3

We got the same centroid and cluster groups which indicates that this dataset has only 2 groups. K-Means clustering stops iteration because of the same cluster repeating so no need to continue iteration and display the last iteration as the best cluster groups for this dataset.

The Below graph explained the difference between iterations 1 and 2. We can see centroids (green dot) changed in the 2nd Iteration.

The Math Behind K-Means Clustering (7)

Conclusion

I hope you are clear about the math steps behind the K-Means Clustering. In this blog, we took a small number for the dataset so we are given a k value is 3 and took 2 iterations. In real-time, the dataset feature will be maximum in that case we should use the Elbow method (WCSS) to get the perfect K(Cluster groups) value. Check out some of my previous blogs:

Implementation of K-Means ClusteringIntroductionmedium.com
Linear Regression - LSMLinear Regression is used to find the relationship between a dependent variable and the independent variable. There are…medium.com
Implementation of Linear Regression using PythonIntroductionmedium.com

Have doubts? Need help? Contact me!

LinkedIn: https://www.linkedin.com/in/dharmaraj-d-1b707898

Github: https://github.com/DharmarajPi

The Math Behind K-Means Clustering (2024)

References

Top Articles
Latest Posts
Article information

Author: Laurine Ryan

Last Updated:

Views: 6114

Rating: 4.7 / 5 (57 voted)

Reviews: 80% of readers found this page helpful

Author information

Name: Laurine Ryan

Birthday: 1994-12-23

Address: Suite 751 871 Lissette Throughway, West Kittie, NH 41603

Phone: +2366831109631

Job: Sales Producer

Hobby: Creative writing, Motor sports, Do it yourself, Skateboarding, Coffee roasting, Calligraphy, Stand-up comedy

Introduction: My name is Laurine Ryan, I am a adorable, fair, graceful, spotless, gorgeous, homely, cooperative person who loves writing and wants to share my knowledge and understanding with you.