K-Means Clustering is an unsupervised machine learning algorithm used to group data points into \(k\) distinct clusters. The algorithm partitions a dataset into \(k\) clusters by minimizing the variance within each cluster. It is commonly used for pattern recognition, data compression, and finding structure in data.

The key idea behind K-Means is to divide data points into clusters such that data points in the same cluster are similar to each other, while data points in different clusters are dissimilar.

The K-Means algorithm iteratively updates cluster centroids and assigns data points to the nearest centroid based on the Euclidean distance between the points and the centroids.

Steps of the K-Means Algorithm:

  1. Choose the number of clusters \(k\).

  2. Initialize: Randomly select \(k\) initial centroids.

  3. Assign each data point to the nearest centroid based on distance (typically Euclidean distance).

  4. Update centroids by calculating the mean of all points assigned to each cluster.

  5. Repeat steps 3 and 4 until the centroids do not change significantly (convergence).

Advantages:

Disadvantages:

Applications:

Pros:

Cons:


K-Means Clustering Example in R

We will use K-Means to cluster data points based on two variables. We’ll generate a dataset of random points and use the K-Means algorithm to identify clusters.

Step 1: Create the Data

We’ll generate a dataset with two features (x and y) that can be grouped into clusters.

# Load necessary library
set.seed(123)

# Generate random data points
x <- c(rnorm(50, mean = 1, sd = 0.5), rnorm(50, mean = 5, sd = 0.5), rnorm(50, mean = 9, sd = 0.5))
y <- c(rnorm(50, mean = 1, sd = 0.5), rnorm(50, mean = 5, sd = 0.5), rnorm(50, mean = 9, sd = 0.5))

# Combine into a data frame
data <- data.frame(x, y)
head(data)

Here, we generate 150 data points distributed around three centers: (1, 1), (5, 5), and (9, 9).

Step 2: Perform K-Means Clustering

We will use the kmeans() function in R to apply K-Means clustering to our dataset. We need to specify the number of clusters \(k\), which we’ll set to 3.

# Apply K-Means clustering with 3 clusters
kmeans_result <- kmeans(data, centers = 3)

# Print the results
print(kmeans_result)
K-means clustering with 3 clusters of sizes 50, 50, 50

Cluster means:
         x        y
1 5.073204 4.995740
2 8.873050 9.124725
3 1.017202 1.019403

Clustering vector:
  [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 [23] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 [45] 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [67] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [89] 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
[111] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[133] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

Within cluster sum of squares by cluster:
[1] 21.03799 22.91031 21.11838
 (between_SS / total_SS =  98.0 %)

Available components:

[1] "cluster"      "centers"      "totss"       
[4] "withinss"     "tot.withinss" "betweenss"   
[7] "size"         "iter"         "ifault"      

Output (simplified):

K-means clustering with 3 clusters of sizes 50, 50, 50

Cluster means:
         x        y
1 1.052739 1.043346
2 9.034017 9.033384
3 4.963213 5.012003

Clustering vector:
 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
...

Interpretation:

  • The algorithm has identified 3 clusters, each containing 50 points.

  • The cluster means (centroids) for the three clusters are approximately (1,1), (5,5), and (9,9), which matches the centers around which the data was generated.

  • The clustering vector indicates the cluster assignment for each data point.

Step 3: Visualize the Clusters

We can visualize the results of K-Means clustering by plotting the data points and coloring them based on their cluster assignment.

# Plot the data points and color them by cluster
plot(data$x, data$y, col = kmeans_result$cluster, pch = 19,
     main = "K-Means Clustering Results",
     xlab = "X", ylab = "Y")

# Add cluster centers to the plot
points(kmeans_result$centers[, 1], kmeans_result$centers[, 2], col = 1:3, pch = 8, cex = 2)

Interpretation:

  • The scatterplot shows the data points, with each point colored based on its cluster assignment.

  • The large crosses represent the cluster centroids, showing where the algorithm has placed the center of each cluster.

  • You can see that the K-Means algorithm has successfully grouped the points into three clusters around their respective centers.


Choosing the Optimal Number of Clusters \(k\)

One common method for selecting the optimal number of clusters is the Elbow Method, which involves plotting the total within-cluster sum of squares (WSS) for different values of \(k\) and looking for an “elbow” point where the WSS begins to decrease more slowly.

Step 4: Elbow Method to Determine \(k\)

We can compute the WSS for different values of \(k\) and plot the results to help determine the optimal number of clusters.

# Compute within-cluster sum of squares (WSS) for different k values
wss <- sapply(1:10, function(k){
  kmeans(data, centers = k)$tot.withinss
})

# Plot the Elbow Method
plot(1:10, wss, type = "b", pch = 19, frame = FALSE,
     xlab = "Number of Clusters",
     ylab = "Total Within-Cluster Sum of Squares",
     main = "Elbow Method for Finding Optimal k")

Interpretation:

  • The Elbow Method plot shows the WSS for different values of \(k\).

  • The point where the decrease in WSS becomes more gradual is the “elbow” point, which suggests the optimal number of clusters.

  • In this case, the elbow typically occurs around \(k = 3\), confirming that 3 clusters is a good choice for this dataset.


K-Means Algorithm Process Summary:

  1. Choose \(k\), the number of clusters.

  2. Randomly initialize \(k\) centroids.

  3. Assign each point to the nearest centroid.

  4. Recalculate the centroids based on the mean of points in each cluster.

  5. Repeat steps 3-4 until convergence.


Conclusion:

K-Means clustering is a powerful and widely-used unsupervised learning algorithm for identifying patterns and groups in data. In R, the kmeans() function makes it easy to apply this algorithm to your dataset, and visualizing the clusters can provide valuable insights into the structure of the data.

