Fast K-Means Clustering - Teradata Warehouse Miner

Purpose

K-Means is one of the three variations of clustering available in TWM, along with the Gaussian Mixture Model and Poisson clustering. For a description of these clustering algorithm variations, see Teradata Warehouse Miner User Guide: Volume 3, Analytic Functions (B035-2302). Fast K-Means clustering is a fourth alternative for clustering and cluster scoring. It is an option in TWM clustering, and also a stand-alone external stored procedure that can be run directly in the Teradata database, independently of TWM. The stand-alone version and its parameters are described in this document.

To run the Fast K-Means algorithm or to score a model built by this algorithm, the td_analyze stored procedure and the tda_kmeans table operator must be installed on the Teradata system, with appropriate permissions such as Execute Procedure. The Fast K-Means feature is dependent on Release 15.00 of the Teradata RDBMS.

The first parameter for clustering is the Kmeans function name, followed by clustering parameters.

Fast K-Means Clustering returns two data sets that are viewed as result sets.

The first result set is a progress report with two columns, a timestamp and a progress message. Use the result to see how the algorithm converged and what made it stop processing.
The second result set contains cluster means and variances. Specifically, the rows associated with positive cluster IDs contain the average values of each of the clustered columns along with the count for each cluster ID. The rows associated with negative cluster IDs contain the variance of each of the clustered columns for each cluster ID.

Syntax

call twm. td_analyze('Kmeans','database=twm_source;tablename=twm_tablename;columns=(columns);outclusterdatabase=twm;outclustertable=cust_analysis_clusters;kvalue=value;iterations=50;threshold=.001;operatordatabase=twm;continuation=true');

Required Parameters

columns: The input columns used in clustering. The columns must reside in the table named with the tablename parameter, residing in the database named with the database parameter.; For example: columns=column1,column2,column3
database: The database containing the input table.
Kmeans: Identifies the type of function being performed.
kvalue: The number of clusters to be contained in the cluster model.
outclusterdatabase: The database to contain the resulting output table that represents a cluster model.
outclustertable: The name of the output table representing the cluster model.
tablename: The name of the table containing the data to cluster.

Optional Parameters

continuation: If true, clustering begins with values determined by pre-existing result tables rather than random values. The default is false.
iterations: The maximum number of iterations to perform during modeling. The default is 50.
operatordatabase: The database where the table operators called by td_analyze reside. If not specified, the database software searches the standard search path for table operators, including the current user database.; For example: operatordatabase=twm
overwrite: When overwrite is set to true (default), the output tables are dropped before creating new ones.
threshold: A decimal value used to determine if the algorithm has converged based on how much the cluster centroids change from one iteration to the next. The default is .001.

Examples

These examples show how to use K-Means. To run the provided examples, the td_analyze function must be installed in a database called twm and the TWM tutorial data must be installed in the twm_source database.

This example invokes the td_analyze stored procedure and the tda_kmeans table operator to perform clustering. The resulting model is stored in twm.cust_analysis_clusters.

call twm.td_analyze('Kmeans','database=twm_source;tablename=twm_customer_analysis;columns=avg_cc_bal,avg_ck_bal,avg_sv_bal;outclusterdatabase=twm;outclustertable=cust_analysis_clusters;kvalue=3;iterations=10;threshold=0.1;operatordatabase=twm;');

This example is a continuation requested from the results of the previous example.

call twm.td_analyze('Kmeans','database=twm_source;tablename=twm_customer_analysis;columns=avg_cc_bal,avg_ck_bal,avg_sv_bal;outclusterdatabase=twm;outclustertable=cust_analysis_clusters;kvalue=3;iterations=10;threshold=0.1;operatordatabase=twm;continuation=true');