Teradata Package for R Function Reference | 17.00 - 17.00 - td_kmeans_valib - Teradata Package for R

Teradata® Package for R Function Reference

Product
Teradata Package for R
Release Number
17.00
Release Date
July 2021
Content Type
Programming Reference
Publication ID
B700-4007-090K
Language
English (United States)

Description

The function performs fast K-Means clustering algorithm and returns cluster means and averages. Specifically, the rows associated with positive cluster IDs in output contain the average values of each of the clustered columns along with the count for each cluster ID. The rows associated with negative cluster IDs contain the variance of each of the clustered columns for each cluster ID.
Note:

  • This function is applicable only on columns containing numeric data.

Usage

td_kmeans_valib(data, columns, centers, ...)

Arguments

data

Required Argument.
Specifies the input data on which K-Means clustering is to be performed.
Types: tbl_teradata

columns

Required Argument.
Specifies the name(s) of the column(s) to be used in clustering. Occasionally, it can also accept permitted strings to specify all columns or all numeric columns.
Permitted Values:

  1. Name(s) of the column(s) in "data".

  2. Pre-defined strings:

    1. 'all' - all columns

    2. 'allnumeric' - all numeric columns

Types: character OR vector of Strings (character)

centers

Required Argument.
Specifies the number of clusters to be contained in the cluster model.
Types: integer

...

Specifies other arguments supported by the function as described in the 'Other Arguments' section.

Value

Function returns an object of class "td_kmeans_valib" which is a named list containing object of class "tbl_teradata".
Named list member can be referenced directly with the "$" operator using name: result.

Other Arguments

exclude.columns

Optional Argument.
Specifies the name(s) of the column(s) to exclude from the clustering analysis.
If 'all' or 'allnumeric' is used in the "columns" argument, this argument can be used to exclude specific columns from the analysis.
Types: character OR vector of Strings (character)

centroids.data

Optional Argument.
Specifies the tbl_teradata containing clustering output, which is used as initial value for clustering algorithm, instead of using random values. If this argument is not specified or NULL, the function starts with random values.
Note:

  • If the argument "centroids.data" is specified, then "centroids.data" tbl_teradata object is overwritten by the result tbl_teradata of kmeansObj.

Types: tbl_teradata

max.iter

Optional Argument.
Specifies the maximum number of iterations to perform during clustering.
Default Value: 50
Types: integer

operator.database

Optional Argument.
Specifies the database where the table operators called by Vantage Analytic Library reside. If not specified, the library searches the standard search path for table operators, including the current database.
Types: character

threshold

Optional Argument.
Specifies the value which determines if the algorithm has converged based on how much the cluster centroids change from one iteration to the next.
Default Value: 0.001
Types: numeric

Examples

# Notes:
#   1. To execute Vantage Analytic Library functions, set option 
#      'val.install.location' to the database name where Vantage analytic 
#      library functions are installed.
#   2. Datasets used in these examples can be loaded using Vantage Analytic 
#      Library installer.

# Set the option 'val.install.location'.
options(val.install.location = "SYSLIB")

# Get remote data source connection.
con <- td_get_context()$connection

# Create an object of class "tbl_teradata".
custanly <- tbl(con, "customer_analysis")
print(custanly)

# Example 1: Run KMeans clustering on the tbl_teradata 'customer_analysis' 
#            with initial random values. The function uses 'all' for "columns" 
#            argument and excludes all non-numeric columns.
obj <- td_kmeans_valib(data=custanly,
                       columns='all',
                       exclude.columns=c("cust_id", "gender", "marital_status", 
                       "city_name", "state_code"),
                       centers=3)

# Print the results.
print(obj$result)

# Example 2: Run KMeans clustering on the tbl_teradata 'customer_analysis' with 
#            pre-existing result tbl_teradata in "centroids.data" argument.
obj <- td_kmeans_valib(data=custanly,
                       columns=c("avg_cc_bal", "avg_ck_bal", "avg_sv_bal"),
                       centers=3,
                       max.iter=5,
                       threshold=0.1)
# Use KMeans result tbl_teradata (from above step) in "centroids.data" argument.
obj1 <- td_kmeans_valib(data=custanly,
                       columns=c("avg_cc_bal", "avg_ck_bal", "avg_sv_bal"),
                       centers=3,
                       max.iter=5,
                       threshold=0.1,
                       centroids.data=obj$result)
# Print the results.
print(obj1$result)