The clustering algorithm requires specification of the desired number of clusters. After preprocessing, an initialization step determines seed values for the clusters, and clustering is then performed based on conditional probability and maximum likelihood principles using the EM algorithm to converge on cluster assignments that yield the maximum likelihood value.
In a Gaussian Mixture (GM) model, it is assumed that the variables being modeled are members of a normal (Gaussian) probability distribution. For each cluster, a maximum likelihood equation can be constructed indicating the probability that a randomly selected observation from that cluster would look like a particular observation. A maximum likelihood rule for classification would assign this observation to the cluster with the highest likelihood value. In the computation of these probabilities, conditional probabilities use the relative size of clusters and prior probabilities, to compute a probability of membership of each row to each cluster. Rows are reassigned to clusters with probabilistic weighting, after units of distance have been transformed to units of standard deviation of the standard normal distribution via the Gaussian distance function:
- p is dimensioned 1 by 1 and is the probability of membership of a point to a cluster
- d is dimensioned 1 by 1 and is the Mahalanobis Distance
- n is dimensioned 1 by 1 and is the number of variables
- R is dimensioned n by n and is the cluster variance/covariance matrix
The Gaussian Distance Function translates distance into a probability of membership under this probabilistic model. Intermediate results are saved in Teradata tables after each iteration, so the algorithm may be stopped at any point and the latest results viewed, or a new clustering process begun at this point. These results consist of cluster means, variances and prior probabilities.