The task of modeling multidimensional data sets encompasses a variety of statistical techniques, including that of ‘cluster analysis’. Cluster analysis is a statistical process for identifying homogeneous groups of data objects. It is based on unsupervised machine learning and is crucial in data mining. Due to the massive sizes of databases today, implementation of any clustering algorithm must be scalable to complete analysis within a practicable amount of time, and must operate on large volumes of data with many variables. Typical clustering statistical algorithms do not work well with large databases due to memory limitations and execution times required.
The advantage of the cluster analysis algorithm in Teradata Warehouse Miner is that it enables scalable data mining operations directly within the Teradata RDBMS. This is achieved by performing the data intensive aspects of the algorithm using dynamically generated SQL, while low-intensity processing is performed in Teradata Warehouse Miner. A second key design feature is that model application or scoring is performed by generating and executing SQL based on information about the model saved in metadata result tables. A third key design feature is the use of the Expectation Maximization or EM algorithm, a particularly sound statistical processing technique. Its simplicity makes possible a purely SQL-based implementation that might not otherwise be feasible with other optimization techniques. And finally, the Gaussian mixture model gives a probabilistic approach to cluster assignment, allowing observations to be assigned probabilities for inclusion in each cluster. The clustering is based on a simplified form of generalized distance in which the variables are assumed to be independent, equivalent to Euclidean distances on standardized measures.
While this section primarily introduces Gaussian Mixture Model clustering, variations of this technique are described in the next section. In particular, the Fast K-Means clustering option uses a quite different technique: a stored procedure and a table operator that process the data more directly in the database for a considerable performance boost.