5.4.5 - Fast K-Means Cluster Scoring - Teradata Warehouse Miner

In-Database Analytic Functions User Guide

prodname
Teradata Warehouse Miner
vrm_release
5.4.5
created_date
February 2018
category
User Guide
featnum
B035-2306-028K

Purpose

After building a model using the Fast K-Means Clustering algorithm, new data can be scored using Fast K-Means Cluster Scoring. The first parameter for Fast K-Means Cluster Scoring is the KmeansScore function name, followed by cluster scoring parameters.

Fast K-Means Cluster Scoring returns one or two data sets that can be viewed as result sets. One result set is a progress report with two columns, a timestamp, and a progress message. The other result set is only returned if the samplescoresize parameter is set. It contains a sampling of the rows in the output score table, the actual number of rows determined by the value of the samplescoresize parameter.

Syntax

call twm. td_analyze('KmeansScore','database=twm_source;tablename=twm_customer_analysis;columns=col1,col2,col3;outscoredatabase=twm;outscoretable=table;keycolumns=col;inclusterdatabase=database;inclustertable=table;kvalue=number;Optional Parameters;');

Required Parameters

columns
The input columns used in clustering. The columns must reside in the table named with the tablename parameter, residing in the database named with the database parameter.
For example: columns=column1,column2,column3
database
The database containing the input table.
inclusterdatabase
The database containing the table that represents the cluster model to score.
inclustertable
The name of the input table containing the cluster model to score.
keycolumns
The names of one or more columns in the input table to use as the primary index of the scored output table.
kvalue
The number of clusters to be contained in the cluster model.
outscoredatabase
The database containing the resulting scored output table.
outscoretable
The name of the scored output table to build.
tablename
The name of the table containing the data to cluster.

Optional Parameters

clustername
The name of the column representing the cluster identifier. The default is clusterid.
fallback
An optional flag to indicate (true), that the scored output table should have the fallback attribute (that is, have a mirrored copy).
operatordatabase
The database where the tda_kmeans table operator called by td_analyze resides. If not specified, the database software searches the standard search path for table operators, including the current user database.
For example: operatordatabase=twm
overwrite

When overwrite is set to true (default), the output tables are dropped before creating new ones.

retaincolumns
A comma-separated list naming columns to include in the scored output table unchanged from their names and values in the input table to be scored.
samplescoresize
The optional number of rows of the output score table to display as a result set.

Example

This example assumes the td_analyze function is installed in a database named twm.

The resulting model in table cust_analysis_clusters scores the twm_customer_analysis table, producing score table twm.cust_analysis_data. Various optional parameters are specified, including samplescoresize, retaincolumns, clustername, and fallback.

call twm.td_analyze('KmeansScore','database=twm_source;tablename=twm_customer_analysis;columns=avg_cc_bal,avg_ck_bal,avg_sv_bal;outscoredatabase=twm;outscoretable=cust_analysis_data;keycolumns=cust_id;inclusterdatabase=twm;inclustertable=cust_analysis_clusters;kvalue=3;operatordatabase=twm;samplescoresize=10;retaincolumns=city_name,state_code;clustername=mycluster;fallback=true;');