K-Means is a popular clustering algorithm found in many data mining products and is one of the three principal variations of clustering available in the Teradata Warehouse Miner (TWM) product, along with the Gaussian Mixture Model and Poisson clustering. Volume 3 of the Teradata Warehouse Miner User Guide contains a description of these clustering algorithm variations. Fast K-Means clustering is offered as a fourth alternative for clustering and cluster scoring. It is available as an option in TWM clustering, and also as a stand-alone external stored procedure that can be executed directly in the Teradata database, independently of TWM. It is the stand-alone version and its parameters that are described in this document.
In order to execute the Fast K-Means algorithm or to score a model built by this algorithm the td_analyze stored procedure and the tda_kmeans table operator must be installed on the Teradata system, with appropriate permissions such as Execute Procedure. The Fast K-Means feature is dependent on Release 15.00 of the Teradata RDBMS.
The first parameter for clustering is the Kmeans function name, followed by clustering parameters.
Fast K-Means Clustering returns two data sets that may be viewed as result sets. The first result set is a progress report with two columns, a timestamp and a progress message, from which the user can see how the algorithm converged and what made it stop processing. The second result set contains cluster means and variances. Specifically, the rows associated with positive cluster IDs contain the average values of each of the clustered columns along with the count for each cluster ID. The rows associated with negative cluster IDs contain the variance of each of the clustered columns for each cluster ID.
Syntaxcall twm. td_analyze('Kmeans','database=twm_source;tablename=twm_tablename;columns=(columns);outclusterdatabase=twm;outclustertable=cust_analysis_clusters;kvalue=value;iterations=50;threshold=.001;operatordatabase=twm;continuation=true');
- The input columns used in clustering. The columns must reside in the table named with the tablename parameter, residing in the database named with the database parameter.
- For example: columns=column1,column2,column3
- The database containing the input table.
- The parameter that identifies the type of function being performed.
- The number of clusters to be contained in the cluster model.
- The database that will contain the resulting output table that represents a cluster model.
- The name of the output table representing the cluster model.
- The name of the table containing the data that is to be clustered.
- Indicates that clustering should (true) or should not (false) begin with values determined by pre-existing result tables rather than random values. The default value is false.
- The maximum number of iterations to perform during modeling. The default is 50.
- The database where the tda_kmeans table operator called by td_analyze resides. If not specified, the database software searches the standard search path for table operators, including the current user database.
- For example: operatordatabase=twm
- A decimal value used to determine if the algorithm has converged based on how much the cluster centroids change from one iteration to the next. The default is .001.
Examples in this section show how to use K-Means. These examples assume that the td_analyze function has been installed in a database named twm.
Here is an example of invoking the td_analyze stored procedure and the tda_kmeans table operator to perform clustering. The resulting model is stored in twm.cust_analysis_clusters.
In this example, continuation is requested from the results of the previous example.