K-Means Option - Teradata Warehouse Miner

Teradata Warehouse Miner User Guide - Volume 3Analytic Functions

Product

Teradata Warehouse Miner

Release Number

5.4.5

Published

February 2018

Language

English (United States)

Last Update

2018-05-04

dita:mapPath

yuy1504291362546.ditamap

dita:ditavalPath

ft:empty

dita:id

B035-2302

Product Category

Software

The K-Means option allows for reassigning rows to clusters by associating each to the closest cluster centroid using the shortest distance. Data points are assumed to belong to only one cluster, and the determination is considered a ‘hard assignment’. After the distances are computed from a given point to each cluster centroid, the point is assigned to the cluster whose center is nearest to the point. On the next iteration, the point’s value is used to redefine that cluster’s mean and variance. This is in contrast to the default Gaussian option, wherein rows are reassigned to clusters with probabilistic weighting, after units of distance have been transformed to units of standard deviation via the Gaussian distance function.

Also, with the K-means option, the variables' distances to cluster centroids are calculated by summing, without any consideration of the variances, resulting effectively in the use of unnormalized Euclidean distances. This implies that variables with large variances will have a greater influence over the cluster definition than those with small variances. Therefore, a typical preparatory step to conducting a K-means cluster analysis is to standardize all of the numeric data to be clustered using the Z-score transformation function in Teradata Warehouse Miner. K-means analyses of data that are not standardized typically produce results that: (a) are dominated by variables with large variances, and (b) virtually or totally ignore variables with small variances during cluster formation. Alternatively, the Rescale function could be used to normalize all numeric data, with a lower boundary of zero and an upper boundary of one. Normalizing the data prior to clustering gives all the variables equal weight.