K-Means

Map > Data Science > Predicting the Future > Modeling > Clustering > K-Means

K-Means Clustering

K-Means clustering intends to partition n objects into k clusters in which each object belongs to the cluster with the nearest mean. This method produces exactly k different clusters of greatest possible distinction. The best number of clusters k leading to the greatest separation (distance) is not known as a priori and must be computed from the data. The objective of K-Means clustering is to minimize total intra-cluster variance, or, the squared error function:

Algorithm

Clusters the data into k groups where k is predefined.
Select k points at random as cluster centers.
Assign objects to their closest cluster center according to the Euclidean distance function.
Calculate the centroid or mean of all objects in each cluster.
Repeat steps 2, 3 and 4 until the same points are assigned to each cluster in consecutive rounds.

K-Means is relatively an efficient method. However, we need to specify the number of clusters, in advance and the final results are sensitive to initialization and often terminates at a local optimum. Unfortunately there is no global theoretical method to find the optimal number of clusters. A practical approach is to compare the outcomes of multiple runs with different k and choose the best one based on a predefined criterion. In general, a large k probably decreases the error but increases the risk of overfitting.

Example:

Suppose we want to group the visitors to a website using just their age (one-dimensional space) as follows:

n = 19

15,15,16,19,19,20,20,21,22,28,35,40,41,42,43,44,60,61,65

Initial clusters (random centroid or average):

k = 2

c₁ = 16
c₂ = 22

Iteration 1:

c₁ = 15.33
c₂ = 36.25

x_i	c₁	c₂	Distance 1	Distance 2	Nearest Cluster	New Centroid
15	16	22	1	7	1	15.33
15	16	22	1	7	1
16	16	22	0	6	1
19	16	22	3	3	2	36.25
19	16	22	3	3	2
20	16	22	4	2	2
20	16	22	4	2	2
21	16	22	5	1	2
22	16	22	6	0	2
28	16	22	12	6	2
35	16	22	19	13	2
40	16	22	24	18	2
41	16	22	25	19	2
42	16	22	26	20	2
43	16	22	27	21	2
44	16	22	28	22	2
60	16	22	44	38	2
61	16	22	45	39	2
65	16	22	49	43	2

Iteration 2:

c₁ = 18.56
c₂ = 45.90

x_i	c₁	c₂	Distance 1	Distance 2	Nearest Cluster	New Centroid
15	15.33	36.25	0.33	21.25	1	18.56
15	15.33	36.25	0.33	21.25	1
16	15.33	36.25	0.67	20.25	1
19	15.33	36.25	3.67	17.25	1
19	15.33	36.25	3.67	17.25	1
20	15.33	36.25	4.67	16.25	1
20	15.33	36.25	4.67	16.25	1
21	15.33	36.25	5.67	15.25	1
22	15.33	36.25	6.67	14.25	1
28	15.33	36.25	12.67	8.25	2	45.9
35	15.33	36.25	19.67	1.25	2
40	15.33	36.25	24.67	3.75	2
41	15.33	36.25	25.67	4.75	2
42	15.33	36.25	26.67	5.75	2
43	15.33	36.25	27.67	6.75	2
44	15.33	36.25	28.67	7.75	2
60	15.33	36.25	44.67	23.75	2
61	15.33	36.25	45.67	24.75	2
65	15.33	36.25	49.67	28.75	2

Iteration 3:

c₁ = 19.50
c₂ = 47.89

x_i	c₁	c₂	Distance 1	Distance 2	Nearest Cluster	New Centroid
15	18.56	45.9	3.56	30.9	1	19.50
15	18.56	45.9	3.56	30.9	1
16	18.56	45.9	2.56	29.9	1
19	18.56	45.9	0.44	26.9	1
19	18.56	45.9	0.44	26.9	1
20	18.56	45.9	1.44	25.9	1
20	18.56	45.9	1.44	25.9	1
21	18.56	45.9	2.44	24.9	1
22	18.56	45.9	3.44	23.9	1
28	18.56	45.9	9.44	17.9	1
35	18.56	45.9	16.44	10.9	2	47.89
40	18.56	45.9	21.44	5.9	2
41	18.56	45.9	22.44	4.9	2
42	18.56	45.9	23.44	3.9	2
43	18.56	45.9	24.44	2.9	2
44	18.56	45.9	25.44	1.9	2
60	18.56	45.9	41.44	14.1	2
61	18.56	45.9	42.44	15.1	2
65	18.56	45.9	46.44	19.1	2

Iteration 4:

c₁ = 19.50
c₂ = 47.89

x_i	c₁	c₂	Distance 1	Distance 2	Nearest Cluster	New Centroid
15	19.5	47.89	4.50	32.89	1	19.50
15	19.5	47.89	4.50	32.89	1
16	19.5	47.89	3.50	31.89	1
19	19.5	47.89	0.50	28.89	1
19	19.5	47.89	0.50	28.89	1
20	19.5	47.89	0.50	27.89	1
20	19.5	47.89	0.50	27.89	1
21	19.5	47.89	1.50	26.89	1
22	19.5	47.89	2.50	25.89	1
28	19.5	47.89	8.50	19.89	1
35	19.5	47.89	15.50	12.89	2	47.89
40	19.5	47.89	20.50	7.89	2
41	19.5	47.89	21.50	6.89	2
42	19.5	47.89	22.50	5.89	2
43	19.5	47.89	23.50	4.89	2
44	19.5	47.89	24.50	3.89	2
60	19.5	47.89	40.50	12.11	2
61	19.5	47.89	41.50	13.11	2
65	19.5	47.89	45.50	17.11	2

No change between iterations 3 and 4 has been noted. By using clustering, 2 groups have been identified 15-28 and 35-65. The initial choice of centroids can affect the output clusters, so the algorithm is often run multiple times with different starting conditions in order to get a fair view of what the clusters should be.

Exercise