In this post, I’d like to discuss about one of basic unsupervised machine learning algorithms, k-means clustering.
Basic idea of k-means clustering is to find positions which minimize the variance of each cluster. In real world, the algorithm is widely used such as documen clustering, recommendation engines, image segmentation and customer segmentation.
K-means clustering algorithm steps:
- Generate K initial centroids by selecting randomly from the data set
- Assign each data point with a cluster-label by finding the centroid which is the closest to the data
- Calculate new centroids by finding the first moment of each cluster over total number of data within the cluster
- Repeat 2 and 3 until centroids converge
Implementation
In order to run this python code, we first need to import python libraries.
1
2
3
4
5
6
|
from random import random
import numpy as np
import math
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import cm
|
Generating Clusters
To generate k number of clusters, I started with k random points within min < x < xmax and ymin < y < ymax.
Then gernate total n data points randomly within the radius r centering at each initial point.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
# cluster generater((-xlen,-ylen) < (x,y) < (xlen,ylen) ).
# k: the number of clusters,
# r: radius of each cluster,
# n: the number of data points within each cluster
def cluster_gen(k,r,n,xlen,ylen):
data = []
#create the center of clusters [x|y] coordinates
kx = np.random.rand(k,1)*xlen*2 - xlen
ky = np.random.rand(k,1)*ylen*2 - ylen
kcent = np.append(kx,ky,1)
#generate n data points around the center of cluster within radius r
for cluster in range(k):
for points in range(n):
x = kcent[cluster,0] + r*(random())*math.cos(random()*np.pi)
y = kcent[cluster,1] + r*(random())*math.cos(random()*np.pi)
data.append([x,y])
return np.array(data)
|
Test cluster generator
Let’s test the cluster generator. We defind k=3 clusters, n=100 points within r=40 radius, and initial bourdary as (-35 < x < 35) and (-35 < y < 35).
1
2
3
4
|
# test cluster generater
data = cluster_gen(k=3, r=40, n=100, xlen=70, ylen=70)
plt.scatter(data[:,0],data[:,1])
a = data[:,0]
|
Kmenas clustering algorithm
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
|
# calculate distance between two points
def dist(a,b):
return np.sum((a - b)**2)
# find xy coordinates of the centroid of a cluster
def cent_mass(data):
l = len(data)
x = sum(row[0] for row in data)
y = sum(row[1] for row in data)
cent = [x/l,y/l]
return cent
# Assign initial k centroids by randomly selecting k samples from data
def init_centroid(k,data):
return data[np.random.randint(data.shape[0],size = k),:]
def k_means(k,data):
k_init = init_centroid(k, data)
breaker = True
while breaker:
# label each data point with k-cluster
cluster = []
for i in range(len(data)):
ksort = []
cur_p = data[i]
for j in range(k):
k_cent = k_init[j]
ksort.append(dist(cur_p,k_cent))
ind_min = np.argmin(ksort)
cluster.append(ind_min)
cluster = np.array(cluster)
cluster = cluster.reshape(len(cluster),1)
# find a new coordinates for each k-means
df = pd.DataFrame(cluster,columns = ['k'])
k_new = []
for i in range(k):
k_index = df.index[df['k'] == i].tolist()
data_in_kgroup = data[k_index]
data_in_kgroup = np.append(data_in_kgroup,[k_init[i]],0)
new_cm = cent_mass(data_in_kgroup)
k_new.append(new_cm)
k_init = np.array(k_init)
k_new = np.array(k_new)
k_diff = k_init - k_new
if np.count_nonzero(k_diff == 0) == k_diff.size:
breaker = False
else:
breaker = True
k_init = k_new
return k_new, cluster
|
Test K-means clustering algorithm
We test K-means clustering algorithm with k = 3. Each cluster is vidualized with different colors and each centroid is represented as a red dot.
1
2
3
4
5
6
7
8
9
10
11
|
k_cord, k_data = k_means(k=3, data=data)
fig, ax = plt.subplots()
# plot each cluster with different colors
scatter = ax.scatter(data[:,0],data[:,1],c=k_data)
legend = ax.legend(*scatter.legend_elements(),loc= "upper left" ,title = "Clusters")
ax.add_artist(legend)
# plot centroids
cent = plt.scatter(k_cord[:,0],k_cord[:,1], c="red", label="Centroid")
plt.legend(loc = 'lower right')
|
The result shows that the algorithm captures 3 clusters successfully.
However, k-means clustering algorithm can be affected by the initial centroids when clusters are dispersed. For example, if two randomly selected initial centroids are located within the same cluster, they can be converged within the same cluster. Here is the example plots.
This plot shows that two centroids on the right side are converged within the same cluster while the centroid on the left side is converged with two clusters. One way of resolving this problem is to make sure initial centroids are further apart. This method is called k-means++ algorithm.
K-means++ algorithm
Now, instead of selecting random initial centroids from the data set, we first pick one centroid and then other centroids are selected based on maximum distance. Here is the steps of k-means++ algorithm.
- Pick one intial centroid randomly from the data set
- Compute distance of each data point from the centroid and assign a point which has the largest distance as the next centroid
- Compute the closest distance of each data point from each centroid
- Assign a point which has the maximum value as the next centroid
- Repeat 3 and 4 until k centroids are selected
Here is the k-means++ code in python.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
# Select initial k centroids by finding a point which is farthest
def init_centroid_better(k,data):
k_init = []
k_init.append(data[np.random.randint(data.shape[0]),:])
for i in range(k-1):
distance = []
for j in range(data.shape[0]):
cur_data = data[j,:]
dist_temp = []
for k in range(len(k_init)):
dist_centroid = dist(cur_data,k_init[k])
dist_temp.append(dist_centroid)
d_min = min(dist_temp)
distance.append(d_min)
ind_cent = np.argmax(distance)
k_init.append(data[ind_cent])
k_init = np.array(k_init)
return k_init
|
Now we can replace the k-means clustering algorithm with better initialization algorithm.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
def k_means(k,data):
#### K-means++ algorithm
k_init = init_centroid_better(k, data)
breaker = True
while breaker:
# label each data point with k-cluster
cluster = []
for i in range(len(data)):
ksort = []
cur_p = data[i]
for j in range(k):
:
:
:
# rest of the code is the same
|
First let’s look at the initial centroids using k-means++ algorithm. 4 initial centroids(red) are selected such that each centroid has maximum distance to each other.
As a result k-means clustering algorithm successfully captures 4 clusters.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
#add additional steps to kmeans clustering algorithm
# num_inter: the number of iteration
def k_means_adapt(k, data, num_iter):
kx = np.zeros([k,num_iter])
ky = np.zeros([k,num_iter])
kcluster = np.zeros([len(data),num_iter])
cost = np.zeros([1,num_iter])
k_coord = np.zeros([k,2])
for i in range(num_iter):
cost_temp = np.zeros([k,1])
k_new, cluster = k_means(k, data)
kcluster[:,i] = cluster.T
ky[:,i] = k_new[:,1].T
kx[:,i] = k_new[:,0].T
df = pd.DataFrame(cluster, columns = ['k'])
for j in range(k):
index = df.index[df['k'] == j].tolist()
kdata = data[index]
l = len(kdata)
dist_cost = np.sum(np.sqrt((kdata[:,0]-kx[j,0])**2+(kdata[:,1] - ky[j,0])**2))/l
cost_temp[j,0] = dist_cost
cost[0,i] = np.sum(cost_temp)
index_min = np.argmin(cost)
k_coord[:,0] = kx[:,index_min].T
k_coord[:,1] = ky[:,index_min].T
data_label = kcluster[:,index_min]
return k_coord, data_label
|
Test K-means clustering algorithm