Summary - Aster Analytics

Teradata Aster Analytics Foundation User Guide

Product

Aster Analytics

Release Number

6.21

Published

November 2016

Language

English (United States)

Last Update

2018-04-14

dita:mapPath

kiu1466024880662.ditamap

dita:ditavalPath

AA-notempfilter_pdf_output.ditaval

dita:id

B700-1021

lifecycle

Product Category

Software

KModes is an extension of KMeans that supports categorical data. KModes models are fit similarly to KMeans models. The core algorithm is an expectation-maximization algorithm that finds a locally optimal solution. The main steps to fitting the model are:

Initialization - A set of K initial cluster centers is selected. This set can be generated using the RandomSample function (RandomSample) which allows the user to sample rows from an input table using the kmeans++ and kmeans|| algorithms. These initialization algorithms generate initial cluster centers that are more likely to lead to better local optima.
E step - Performed by a mapper. Each point in the input table is assigned to one of the K clusters, and the sums of the numerical attributes and counts of the categorical attributes are stored.
M step - Performed by a reducer. The statistics generated by each worker in the E step are aggregated and new cluster centers are generated. For numerical attributes, the new center is the mean of the value of the attribute for the points assigned to the cluster. For categorical attributes, the new center it the mode of the attribute value for the points assigned to the cluster.

The algorithm runs for either a set number of iterations or until the change in movement of the cluster centers drops below a user-specified threshold.

When assigning points to a cluster, a hybrid distance function that combines a numeric distance and a categorical distance is required. The default distance between two data points in a KModes model is the squared Euclidean distance:

where N denotes the indices of numerical attributes, C denotes the indices of categorical attributes, and wj denotes the weight to be assigned to a category.

The Manhattan distance can also be used: