Fast K-Means Clustering | Vantage Analytics Library - Fast K-Means Clustering - Vantage Analytics Library

Vantage Analytics Library User Guide

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
Lake
VMware
Product
Vantage Analytics Library
Release Number
2.2.0
Published
March 2023
Language
English (United States)
Last Update
2024-01-02
dita:mapPath
ibw1595473364329.ditamap
dita:ditavalPath
iup1603985291876.ditaval
dita:id
zyl1473786378775
Product Category
Teradata Vantage

Modeling multidimensional datasets uses many statistical techniques, including cluster analysis, a statistical process for identifying homogeneous groups of data objects. Cluster analysis is based on unsupervised machine learning and is crucial in data mining. Typical clustering algorithms work poorly with large databases due to memory limitations and long execution times.

K-Means clustering assigns each row of data to a cluster centroid. The algorithm assumes each data point belongs to only one cluster, and the determination is considered a hard assignment. After the algorithm computes the distance from a given data point to each cluster centroid, the algorithm assigns the data point to the cluster whose center is nearest to the data point. On the next iteration, the algorithm uses the point value to redefine the mean and variance of its assigned cluster.

K-Means clustering algorithms compute the distance from a data point to a cluster centroid by summing, without considering variances. (That is, they compute unnormalized Euclidean distances.) Therefore, variables with large variances have more influence over cluster definition than those with small variances.

Fast K-Means Clustering outputs a clustering model that Fast K-Means Cluster Scoring can use to score new data.

Data Preprocessing

To prepare data for K-Means clustering, Teradata recommends doing the following:
  • Normalize numeric variables to give them equal weight.

    Use the Z-Score transformation or the Rescale transformation with lower bound 0 and upper bound 1.

    Teradata errors can occur for nonnormalized numeric values with more than 15 significant digits. If you do not normalize the numeric variables, try to prevent overflow and underflow conditions by multiplying small numbers by a constant value and dividing large numbers by a constant value. This changes the unit of measure but does not affect the clusters.

  • Convert categorical variables to numeric variables, using the Design Code transformation.
  • Replace null values so they do not bias or invalidate the analysis, using the Null Replacement transformation.