In this post, I’d like to discuss about one of basic unsupervised machine learning algorithms, k-means clustering.

Basic idea of k-means clustering is to find positions which minimize the variance of each cluster. In real world, the algorithm is widely used such as documen clustering, recommendation engines, image segmentation and customer segmentation.

K-means clustering algorithm steps:

Generate K initial centroids by selecting randomly from the data set
Assign each data point with a cluster-label by finding the centroid which is the closest to the data
Calculate new centroids by finding the first moment of each cluster over total number of data within the cluster
Repeat 2 and 3 until centroids converge

Implementation

In order to run this python code, we first need to import python libraries.

1
2
3
4
5
6


from random import random
import numpy as np
import math
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import cm

Generating Clusters

To generate k number of clusters, I started with k random points within min < x < xmax and ymin < y < ymax.
Then gernate total n data points randomly within the radius r centering at each initial point.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


# cluster generater((-xlen,-ylen) < (x,y) < (xlen,ylen) ). 
# k: the number of clusters, 
# r: radius of each cluster, 
# n: the number of data points within each cluster

def cluster_gen(k,r,n,xlen,ylen): 
    data = []
    
    #create the center of clusters [x|y] coordinates
    kx = np.random.rand(k,1)*xlen*2 - xlen
    ky = np.random.rand(k,1)*ylen*2 - ylen
    kcent = np.append(kx,ky,1)
    #generate n data points around the center of cluster within radius r
    for cluster in range(k):
        for points in range(n):
            x = kcent[cluster,0] + r*(random())*math.cos(random()*np.pi)
            y = kcent[cluster,1] + r*(random())*math.cos(random()*np.pi)
            data.append([x,y])
            
    return np.array(data)

Test cluster generator

Let’s test the cluster generator. We defind k=3 clusters, n=100 points within r=40 radius, and initial bourdary as (-35 < x < 35) and (-35 < y < 35).

1
2
3
4


# test cluster generater
data = cluster_gen(k=3, r=40, n=100, xlen=70, ylen=70)
plt.scatter(data[:,0],data[:,1])
a = data[:,0]

Kmenas clustering algorithm

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53


# calculate distance between two points
def dist(a,b):
    return np.sum((a - b)**2)

# find xy coordinates of the centroid of a cluster
def cent_mass(data):
    l = len(data)
    x = sum(row[0] for row in data)
    y = sum(row[1] for row in data)
    cent = [x/l,y/l]
    return cent

# Assign initial k centroids by randomly selecting k samples from data
def init_centroid(k,data):
    return data[np.random.randint(data.shape[0],size = k),:]

def k_means(k,data):
    k_init = init_centroid(k, data)
    breaker = True
    while breaker:
        # label each data point with k-cluster
        cluster = []
        for i in range(len(data)):
            ksort = []
            cur_p = data[i]
            for j in range(k):
                k_cent = k_init[j]
                ksort.append(dist(cur_p,k_cent))
            ind_min = np.argmin(ksort)
            cluster.append(ind_min)
        cluster = np.array(cluster)
        cluster = cluster.reshape(len(cluster),1)

        # find a new coordinates for each k-means
        df = pd.DataFrame(cluster,columns = ['k'])
        k_new = []
        for i in range(k):
            k_index = df.index[df['k'] == i].tolist()
            data_in_kgroup = data[k_index]
            data_in_kgroup = np.append(data_in_kgroup,[k_init[i]],0)
            new_cm = cent_mass(data_in_kgroup)
            k_new.append(new_cm)
        k_init = np.array(k_init)
        k_new = np.array(k_new)
        k_diff = k_init - k_new
        
        if np.count_nonzero(k_diff == 0) == k_diff.size:
            breaker = False
            
        else:
            breaker = True
        k_init = k_new
    return k_new, cluster

Test K-means clustering algorithm

We test K-means clustering algorithm with k = 3. Each cluster is vidualized with different colors and each centroid is represented as a red dot.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


k_cord, k_data = k_means(k=3, data=data)

fig, ax = plt.subplots()
# plot each cluster with different colors 
scatter = ax.scatter(data[:,0],data[:,1],c=k_data)
legend = ax.legend(*scatter.legend_elements(),loc= "upper left" ,title = "Clusters")
ax.add_artist(legend)

# plot centroids
cent = plt.scatter(k_cord[:,0],k_cord[:,1], c="red", label="Centroid")
plt.legend(loc = 'lower right')

The result shows that the algorithm captures 3 clusters successfully.

However, k-means clustering algorithm can be affected by the initial centroids when clusters are dispersed. For example, if two randomly selected initial centroids are located within the same cluster, they can be converged within the same cluster. Here is the example plots.

This plot shows that two centroids on the right side are converged within the same cluster while the centroid on the left side is converged with two clusters. One way of resolving this problem is to make sure initial centroids are further apart. This method is called k-means++ algorithm.

K-means++ algorithm

Now, instead of selecting random initial centroids from the data set, we first pick one centroid and then other centroids are selected based on maximum distance. Here is the steps of k-means++ algorithm.

Pick one intial centroid randomly from the data set
Compute distance of each data point from the centroid and assign a point which has the largest distance as the next centroid
Compute the closest distance of each data point from each centroid
Assign a point which has the maximum value as the next centroid
Repeat 3 and 4 until k centroids are selected

Here is the k-means++ code in python.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


# Select initial k centroids by finding a point which is farthest
def init_centroid_better(k,data):
    k_init = []
    k_init.append(data[np.random.randint(data.shape[0]),:])
    for i in range(k-1):
        distance = []
        for j in range(data.shape[0]):
            cur_data = data[j,:]
            dist_temp = []
            for k in range(len(k_init)):
                dist_centroid = dist(cur_data,k_init[k])
                dist_temp.append(dist_centroid)
            d_min = min(dist_temp)
            distance.append(d_min)
        ind_cent = np.argmax(distance)
        k_init.append(data[ind_cent])
    k_init = np.array(k_init)
    return k_init     

Now we can replace the k-means clustering algorithm with better initialization algorithm.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


def k_means(k,data):
    
    #### K-means++ algorithm
    k_init = init_centroid_better(k, data) 
    
    breaker = True
    while breaker:
        # label each data point with k-cluster
        cluster = []
        for i in range(len(data)):
            ksort = []
            cur_p = data[i]
            for j in range(k):

                :
                :
                :
                # rest of the code is the same 

First let’s look at the initial centroids using k-means++ algorithm. 4 initial centroids(red) are selected such that each centroid has maximum distance to each other.

As a result k-means clustering algorithm successfully captures 4 clusters.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31


#add additional steps to kmeans clustering algorithm
# num_inter: the number of iteration
def k_means_adapt(k, data, num_iter):
    kx = np.zeros([k,num_iter])
    ky = np.zeros([k,num_iter])
    kcluster = np.zeros([len(data),num_iter])
    cost = np.zeros([1,num_iter])
    k_coord = np.zeros([k,2])
    for i in range(num_iter):
        cost_temp = np.zeros([k,1])
        k_new, cluster = k_means(k, data)
        kcluster[:,i] = cluster.T
        ky[:,i] = k_new[:,1].T
        kx[:,i] = k_new[:,0].T
        
        df = pd.DataFrame(cluster, columns = ['k'])
        for j in range(k):
            index = df.index[df['k'] == j].tolist()
            kdata = data[index]
            l = len(kdata)
            dist_cost = np.sum(np.sqrt((kdata[:,0]-kx[j,0])**2+(kdata[:,1] - ky[j,0])**2))/l
            cost_temp[j,0] = dist_cost
        cost[0,i] = np.sum(cost_temp)
   
    index_min = np.argmin(cost)

    k_coord[:,0] = kx[:,index_min].T
    k_coord[:,1] = ky[:,index_min].T
    data_label = kcluster[:,index_min]
    
    return k_coord, data_label