Fast K-Means Cluster Scoring

In-Database Analytic Functions User Guide

brand
Software
prodname
Teradata Warehouse Miner
vrm_release
5.4.2
category
User Guide
featnum
B035-2306-106K

Purpose

After building a model using the Fast K-Means Clustering algorithm, new data can be scored using Fast K-Means Cluster Scoring. The first parameter for Fast K-Means Cluster Scoring is the KmeansScore function name, followed by cluster scoring parameters.

Fast K-Means Cluster Scoring returns one or two data sets that can be viewed as result sets. One result set is a progress report with two columns, a timestamp and a progress message. The other result set is only returned if the samplescoresize parameter is set. It contains a sampling of the rows in the output score table, the actual number of rows determined by the value of the samplescoresize parameter.

Syntax

call twm. td_analyze('KmeansScore','database=twm_source;tablename=twm_customer_analysis;columns=col1,col2,col3;outscoredatabase=twm;outscoretable=table;keycolumns=col;inclusterdatabase=database;inclustertable=table;kvalue=number;Optional Parameters;');

Required Parameters

columns
The input columns used in clustering. The columns must reside in the table named with the tablename parameter, residing in the database named with the database parameter.
For example: columns=column1,column2,column3
database
The database containing the input table.
inclusterdatabase
The database that contains the table that represents the cluster model to be scored.
inclustertable
The name of the input table that contains the cluster model to be scored.
keycolumns
The names of one or more columns in the input table to be used as the primary index of the scored output table.
kvalue
The number of clusters to be contained in the cluster model.
outscoredatabase
The database that will contain the resulting scored output table.
outscoretable
The name of the scored output table to be built.
tablename
The name of the table containing the data that is to be clustered.

Optional Parameters

clustername
The name of the column representing the cluster identifier. The default value is clusterid.
fallback
An optional flag to indicate, with a value equal to true, that the scored output table should have the fallback attribute (that is, have a mirrored copy).
operatordatabase
The database where the tda_kmeans table operator called by td_analyze resides. If not specified, the database software searches the standard search path for table operators, including the current user database.
For example: operatordatabase=twm
retaincolumns
A comma separated list naming columns to be included in the scored output table unchanged from their names and values in the input table to be scored.
samplescoresize
The optional number of rows of the output score table to be displayed as a result set.

Example

The following example assumes that the td_analyze function has been installed in a database named twm.

The resulting model in table cust_analysis_clusters is used to score the twm_customer_analysis table, producing score table twm.cust_analysis_data. Various optional parameters have been specified, including samplescoresize, retaincolumns, clustername, and fallback.

call twm.td_analyze('KmeansScore','database=twm_source;tablename=twm_customer_analysis;columns=avg_cc_bal,avg_ck_bal,avg_sv_bal;outscoredatabase=twm;outscoretable=cust_analysis_data;keycolumns=cust_id;inclusterdatabase=twm;inclustertable=cust_analysis_clusters;kvalue=3;operatordatabase=twm;samplescoresize=10;retaincolumns=city_name,state_code;clustername=mycluster;fallback=true;');