K-Means Clustering is an unsupervised machine
learning algorithm used to group data points into \(k\) distinct clusters. The algorithm
partitions a dataset into \(k\)
clusters by minimizing the variance within each cluster. It is commonly
used for pattern recognition, data compression, and finding structure in
data.
The key idea behind K-Means is to divide data points into clusters
such that data points in the same cluster are similar to each other,
while data points in different clusters are dissimilar.
The K-Means algorithm iteratively updates cluster centroids and
assigns data points to the nearest centroid based on the Euclidean
distance between the points and the centroids.
Steps of the K-Means Algorithm:
Choose the number of clusters \(k\).
Initialize: Randomly select \(k\) initial centroids.
Assign each data point to the nearest centroid
based on distance (typically Euclidean distance).
Update centroids by calculating the mean of all
points assigned to each cluster.
Repeat steps 3 and 4 until the centroids do not
change significantly (convergence).
Advantages:
Simple and efficient: K-Means is easy to
implement and works well with large datasets.
Scalable: It can handle a large number of
features and data points efficiently.
Flexible: It can adapt to different data
distributions and clusters.
Disadvantages:
Sensitive to the initial choice of centroids:
Different initializations can lead to different clusters (local
minima).
Requires choosing \(k\): The number of clusters \(k\) must be specified in advance.
Sensitive to outliers: Outliers can
significantly affect the cluster assignments.
Assumes spherical clusters: K-Means performs
best when clusters are circular or spherical, as it uses Euclidean
distance.
Applications:
Market segmentation: Grouping customers based on
purchasing behavior or demographic information.
Image compression: Reducing the number of colors
in an image by grouping pixels into color clusters.
Anomaly detection: Identifying outliers in
financial transactions or network traffic.
Pros:
Computationally fast and scalable.
Works well with large datasets.
Easy to interpret and visualize in low dimensions.
Cons:
Requires the number of clusters \(k\) to be pre-specified.
Struggles with non-convex or irregular-shaped clusters.
Sensitive to the scale of data and outliers.
K-Means Clustering Example in R
We will use K-Means to cluster data points based on two variables.
We’ll generate a dataset of random points and use the K-Means algorithm
to identify clusters.
Step 1: Create the Data
We’ll generate a dataset with two features (x and y) that can be
grouped into clusters.
# Load necessary library
set.seed(123)
# Generate random data points
x <- c(rnorm(50, mean = 1, sd = 0.5), rnorm(50, mean = 5, sd = 0.5), rnorm(50, mean = 9, sd = 0.5))
y <- c(rnorm(50, mean = 1, sd = 0.5), rnorm(50, mean = 5, sd = 0.5), rnorm(50, mean = 9, sd = 0.5))
# Combine into a data frame
data <- data.frame(x, y)
head(data)
Here, we generate 150 data points distributed around three centers:
(1, 1), (5, 5), and (9, 9).
Output (simplified):
K-means clustering with 3 clusters of sizes 50, 50, 50
Cluster means:
x y
1 1.052739 1.043346
2 9.034017 9.033384
3 4.963213 5.012003
Clustering vector:
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
...
Interpretation:
The algorithm has identified 3 clusters, each containing 50
points.
The cluster means (centroids) for the three
clusters are approximately (1,1), (5,5), and (9,9), which matches the
centers around which the data was generated.
The clustering vector indicates the cluster
assignment for each data point.
Step 3: Visualize the Clusters
We can visualize the results of K-Means clustering by plotting the
data points and coloring them based on their cluster assignment.
# Plot the data points and color them by cluster
plot(data$x, data$y, col = kmeans_result$cluster, pch = 19,
main = "K-Means Clustering Results",
xlab = "X", ylab = "Y")
# Add cluster centers to the plot
points(kmeans_result$centers[, 1], kmeans_result$centers[, 2], col = 1:3, pch = 8, cex = 2)

Interpretation:
The scatterplot shows the data points, with each point colored
based on its cluster assignment.
The large crosses represent the cluster centroids, showing where
the algorithm has placed the center of each cluster.
You can see that the K-Means algorithm has successfully grouped
the points into three clusters around their respective centers.
Choosing the Optimal Number of Clusters \(k\)
One common method for selecting the optimal number of clusters is the
Elbow Method, which involves plotting the total
within-cluster sum of squares (WSS) for different values of \(k\) and looking for an “elbow” point where
the WSS begins to decrease more slowly.
Step 4: Elbow Method to Determine \(k\)
We can compute the WSS for different values of \(k\) and plot the results to help determine
the optimal number of clusters.
# Compute within-cluster sum of squares (WSS) for different k values
wss <- sapply(1:10, function(k){
kmeans(data, centers = k)$tot.withinss
})
# Plot the Elbow Method
plot(1:10, wss, type = "b", pch = 19, frame = FALSE,
xlab = "Number of Clusters",
ylab = "Total Within-Cluster Sum of Squares",
main = "Elbow Method for Finding Optimal k")

Interpretation:
The Elbow Method plot shows the WSS for
different values of \(k\).
The point where the decrease in WSS becomes more gradual is the
“elbow” point, which suggests the optimal number of clusters.
In this case, the elbow typically occurs around \(k = 3\), confirming that 3 clusters is a
good choice for this dataset.
K-Means Algorithm Process Summary:
Choose \(k\), the number of
clusters.
Randomly initialize \(k\)
centroids.
Assign each point to the nearest centroid.
Recalculate the centroids based on the mean of points in each
cluster.
Repeat steps 3-4 until convergence.
Conclusion:
K-Means clustering is a powerful and widely-used unsupervised
learning algorithm for identifying patterns and groups in data. In R,
the kmeans()
function makes it easy to apply this algorithm
to your dataset, and visualizing the clusters can provide valuable
insights into the structure of the data.
