Machine Learning - Lecture 3: k-means clustering

Chris Thornton

Introduction

Nearest-neighbour methods are a simple way of using clump structure for prediction.

They avoid the need to construct a model.

Unfortunately, they have a serious weakness.

As the number of data increases, the task of finding nearest neighbours gets harder.

Eventually, the time (or storage costs) involved may be unacceptable.

At this point, we have no choice but to revert to an approach based on explicit modeling.

Clustering

The model we obtain in a particular case all depends on the patterns we go looking for, and the way in which we then represent those patterns.

Methods which aim to find and represent basic clumps in the data are known as clustering methods.

An enormous number of these have been devised, many in the field of statistics.

We will focus on just two of the best known methods.

Agglomerative clustering

This method builds a tree structure representing the way in which datapoints clump together in a hierarchical way.

There are four main steps:

Initialise clusters so each datapoint has its own cluster.
Calculate the similarity of every pair of clusters by averaging similarities of their datapoints.
Form a new cluster by combining the two most similar clusters.
Exit if just one cluster left. Otherwise repeat from step 2.

Example

Let's say we have a dataset based on two variables: VL and TD.

Visualisation

With low-dimensional data we have the great advantage of being able to look at the data as a distribution of points.

  29  x
  28
  27
  26
  25
  24
  23
  22                       x
  21
  20
  19        x
  18               x    x
  17     x        x
  16  x        x
  15                                                 x
  14           x
      0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16

Hierarchy produced by agglomerative clustering

  |-------------------------- <15 16>
  |--
      |---------------------- <29 0>
      |--
          |------------------ <22 7>
          |--
              |-------------- <18 6>
              |--
                  |--
                  |   |------ <14 3>
                  |   |--
                  |       |-- <16 3>
                  |       |-- <17 4>
                  |--
                      |------ <19 2>
                      |--
                          |-- <17 1>
                          |-- <16 0>

Non-hierarchical clustering

While cluster hierarchies provide interesting information about groupings at different levels of description, what we really want for prediction is a method that divides the data into a single set of groups.

This effect can be obtained using k-means clustering.

This is the most widely used, non-hierarchical clustering method.

k-means clustering algorithm

Let's say the data distribute into k clumps, and we want to know where these clumps are.

For this method, we use imaginary datapoints called means or centroids to represent clump centres.

Having initialized exactly k centroids to random positions, we then apply the following two steps:

Each datapoint is assigned to its closest centroid.
The position of each centroid is set to be the average of all the datapoints assigned to it.

These two steps are then repeated as long as we see any change in the assignments.

The general effect is that the centroids `move' rapidly to the k most densely populated parts of the dataspace.

Using centroids for prediction

Models based on centroids can be used for prediction by applying the nearest-neighbour approach.

For each centroid, we work out which classification is most common among its captured datapoints.

That classification is then associated with the centroid.

Then we just apply the nearest-neighbour rule, but using centroids rather than datapoints.

Any new case is predicted to have the classification associated with its nearest centroid.

Demo

Creating data: 1s, 0s, Xs
Listing data
What happens with a single centroid
What happens with more than one centroid
How do we find the right number of centroids?
Color coding for predictions
Use of unseen datapoints

Summary

As dataset size increases, nearest-neighbour methods get increasingly costly.
Eventually, the only alternative is to produce some form of explicit model.
Clustering methods aim to model the way data `clump' together.
Agglomerative clustering produces a hierarchical model.
k-means-clustering is a simple and efficient way of deriving a non-hierarchical model.
Applying nearest-neighbour rules to cluster-centroids can be a way of generating predictions and classifications.

Questions

In k-means-clustering, why do centroids appear to `repel' each other.
Estimate the amount of memory required in executing the k-means clustering algorithm.
On what assumptions can we say that that k-means-clustering performs induction?
In a random distribution of positive and negative data-points, estimate the number of centroids needed to accurately model the data.
What sort of clump are we expecting when we use measurements of city-block distance for clustering?