KModes (ML Engine) - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product

Teradata Vantage

Release Number

8.10

1.1

Published

October 2019

Language

English (United States)

Last Update

2019-12-31

dita:mapPath

ima1540829771750.ditamap

dita:ditavalPath

jsj1481748799576.ditaval

dita:id

B700-4003

lifecycle

Product Category

Teradata Vantage™

KModes is an extension of KMeans that supports categorical data. KModes models are fit similarly to KMeans models. The core algorithm is an expectation-maximization algorithm that finds a locally optimal solution.

These are the main steps to fitting the model:

Initialization
Select a set of K initial cluster centers. You can create this set with the RandomSample (ML Engine) function, which samples rows from an input table using the kmeans++ and kmeans|| algorithms. These initialization algorithms create initial cluster centers that are more likely to lead to better local optima.
E step
A mapper assigns each point in the input table to one of the K clusters, and stores the sums of the numerical attributes and counts of the categorical attributes.
M step
A reducer aggregates the the statistics created for each worker in the E step and creates new cluster centers. For numerical attributes, the new center is the mean of the value of the attribute for the points assigned to the cluster. For categorical attributes, the new center is the mode of the attribute value for the points assigned to the cluster.

The algorithm runs for either a set number of iterations or until the change in movement of the cluster centers drops below a user-specified threshold.

When assigning points to a cluster, a hybrid distance function that combines a numeric distance and a categorical distance is required. The default distance between two data points in a KModes model is the squared Euclidean distance:

Formula for default distance between 2 data points in KModes model (squared Euclidean distance). Used by Machine Learning Engine function KModes.

where N denotes the indexes of numerical attributes, C denotes the indexes of categorical attributes, and wj denotes the weight to assign to a category.

The Manhattan distance can also be used:

Formula for default distance between 2 data points in a KModes model (Manhattan distance). Used by Machine Learning Engine function KModes.