TD_KMeans Usage Notes - Teradata Vantage

Teradata® VantageCloud Lake

Deployment

VantageCloud

Edition

Lake

Product

Teradata Vantage

Published

January 2023

Language

English (United States)

Last Update

2024-04-03

dita:mapPath

phg1621910019905.ditamap

dita:ditavalPath

pny1626732985837.ditaval

dita:id

phg1621910019905

K-means is an algorithm that does unsupervised clustering. This method groups a set of n observations into k clusters based on their proximity to the cluster centers. The objective of the algorithm is to minimize the within-cluster variance, meaning that similar observations belong to the same cluster. The goal is to assign all n points to their respective clusters.

The k-means algorithm calculates the distance between a point and each cluster center, the algorithm assigns the point to the nearest cluster. The k-means algorithm assumes that data points that are close together are similar.

The number of clusters or k is a crucial hyperparameter. If this value is not known, the algorithm uses a method to determine the optimal value of k.

Applications such as Market Segmentation, Document Clustering, Image Segmentation, and Image Compression uses k-means algorithm. Although the algorithm is simple and achieves good performance, the algorithm is sensitive to outliers that can affect the cluster centers. K-means can also become slow when processing larger datasets because the algorithm needs to compare data points.

Why Use K-Means

K-means clustering is an unsupervised learning algorithm that separates an unlabelled dataset into different clusters. The value k determines the number of pre-defined clusters to create, with k=2 resulting in two clusters, k=3 resulting in three clusters, and so on.

This algorithm enables you to group data into different categories and identify these groups in an unlabeled dataset without any training. K-means is a centroid-based algorithm where each cluster is associated with a centroid. The primary goal of the algorithm is to minimize the total distance between data points and their respective clusters.

To implement the algorithm, the input is an unlabeled dataset that is divided into k clusters, and the process repeats until the best clusters are found. You need to predetermine the value of k.

The k-means algorithm performs two main tasks: iteratively determining the best value for k center points or centroids and assigning each data point to its nearest centroid to create a cluster. Each cluster contains data points with similarities that distinguish the data points from other clusters.

The following diagram illustrates how k-means algorithm works.

K-means is a simple, fast, and versatile algorithm that has been applied to a wide range of problems. However, its performance can be sensitive to the:

Choice of seed or the order of data points.
Outliers.
Scale of different variables.
Large datasets (Computation Cost).

K-means upsides:

Simple to implement.
Faster than other clustering algorithms (like hierarchical clustering).
Guarantees convergence.
Generalizes to clusters of different shapes and sizes, such as elliptical clusters.

In conclusion, k-means is a simple, yet powerful algorithm that has different applications in machine learning and computer science. These include Anomaly Detection, Image Segmentation, and Recommendation Engines to name a few. Although you need to provide a value of k (the number of clusters), you can find the number of clusters using other methods like Elbow or Silhouette method.