Fast K-Means Clustering | Vantage Analytics Library - Fast K-Means Clustering

Fast K-Means Clustering | Vantage Analytics Library - Fast K-Means Clustering - Vantage Analytics Library

Vantage Analytics Library User Guide

Deployment

VantageCloud

VantageCore

Edition

Enterprise

IntelliFlex

Lake

VMware

Product

Vantage Analytics Library

Release Number

2.2.0

Published

March 2023

Language

English (United States)

Last Update

2024-01-02

dita:mapPath

ibw1595473364329.ditamap

dita:ditavalPath

iup1603985291876.ditaval

dita:id

zyl1473786378775

Product Category

Teradata Vantage

Modeling multidimensional datasets uses many statistical techniques, including cluster analysis, a statistical process for identifying homogeneous groups of data objects. Cluster analysis is based on unsupervised machine learning and is crucial in data mining. Typical clustering algorithms work poorly with large databases due to memory limitations and long execution times.

K-Means clustering assigns each row of data to a cluster centroid. The algorithm assumes each data point belongs to only one cluster, and the determination is considered a hard assignment. After the algorithm computes the distance from a given data point to each cluster centroid, the algorithm assigns the data point to the cluster whose center is nearest to the data point. On the next iteration, the algorithm uses the point value to redefine the mean and variance of its assigned cluster.

K-Means clustering algorithms compute the distance from a data point to a cluster centroid by summing, without considering variances. (That is, they compute unnormalized Euclidean distances.) Therefore, variables with large variances have more influence over cluster definition than those with small variances.

Fast K-Means Clustering outputs a clustering model that Fast K-Means Cluster Scoring can use to score new data.

Data Preprocessing

To prepare data for K-Means clustering, Teradata recommends doing the following:

Normalize numeric variables to give them equal weight.
Use the Z-Score transformation or the Rescale transformation with lower bound 0 and upper bound 1.

Teradata errors can occur for nonnormalized numeric values with more than 15 significant digits. If you do not normalize the numeric variables, try to prevent overflow and underflow conditions by multiplying small numbers by a constant value and dividing large numbers by a constant value. This changes the unit of measure but does not affect the clusters.
Convert categorical variables to numeric variables, using the Design Code transformation.
Replace null values so they do not bias or invalidate the analysis, using the Null Replacement transformation.