Description
The function performs fast K-Means clustering algorithm and returns cluster
means and averages. Specifically, the rows associated with positive cluster
IDs in output contain the average values of each of the clustered columns
along with the count for each cluster ID. The rows associated with negative
cluster IDs contain the variance of each of the clustered columns for each
cluster ID.
Note:
This function is applicable only on columns containing numeric data.
Usage
td_kmeans_valib(data, columns, centers, ...)
Arguments
data |
Required Argument. |
columns |
Required Argument.
Types: character OR vector of Strings (character) |
centers |
Required Argument. |
... |
Specifies other arguments supported by the function as described in the 'Other Arguments' section. |
Value
Function returns an object of class "td_kmeans_valib"
which is a named list containing object of class "tbl_teradata".
Named list member can be referenced directly with the "$" operator
using name: result.
Other Arguments
exclude.columns
Optional Argument.
Specifies the name(s) of the column(s) to exclude from the
clustering analysis.
If 'all' or 'allnumeric' is used in the "columns" argument,
this argument can be used to exclude specific columns from the
analysis.
Types: character OR vector of Strings (character)
centroids.data
Optional Argument.
Specifies the tbl_teradata containing
clustering output, which is used as initial value
for clustering algorithm, instead of using random
values. If this argument is not specified or NULL,
the function starts with random values.
Note:
If the argument "centroids.data" is specified, then "centroids.data" tbl_teradata object is overwritten by the result tbl_teradata of kmeansObj.
Types: tbl_teradata
max.iter
Optional Argument.
Specifies the maximum number of iterations to perform
during clustering.
Default Value: 50
Types: integer
operator.database
Optional Argument.
Specifies the database where the table operators
called by Vantage Analytic Library reside. If not
specified, the library searches the standard
search path for table operators, including the
current database.
Types: character
threshold
Optional Argument.
Specifies the value which determines if the algorithm
has converged based on how much the cluster centroids
change from one iteration to the next.
Default Value: 0.001
Types: numeric
Examples
# Notes:
# 1. To execute Vantage Analytic Library functions, set option
# 'val.install.location' to the database name where Vantage analytic
# library functions are installed.
# 2. Datasets used in these examples can be loaded using Vantage Analytic
# Library installer.
# Set the option 'val.install.location'.
options(val.install.location = "SYSLIB")
# Get remote data source connection.
con <- td_get_context()$connection
# Create an object of class "tbl_teradata".
custanly <- tbl(con, "customer_analysis")
print(custanly)
# Example 1: Run KMeans clustering on the tbl_teradata 'customer_analysis'
# with initial random values. The function uses 'all' for "columns"
# argument and excludes all non-numeric columns.
obj <- td_kmeans_valib(data=custanly,
columns='all',
exclude.columns=c("cust_id", "gender", "marital_status",
"city_name", "state_code"),
centers=3)
# Print the results.
print(obj$result)
# Example 2: Run KMeans clustering on the tbl_teradata 'customer_analysis' with
# pre-existing result tbl_teradata in "centroids.data" argument.
obj <- td_kmeans_valib(data=custanly,
columns=c("avg_cc_bal", "avg_ck_bal", "avg_sv_bal"),
centers=3,
max.iter=5,
threshold=0.1)
# Use KMeans result tbl_teradata (from above step) in "centroids.data" argument.
obj1 <- td_kmeans_valib(data=custanly,
columns=c("avg_cc_bal", "avg_ck_bal", "avg_sv_bal"),
centers=3,
max.iter=5,
threshold=0.1,
centroids.data=obj$result)
# Print the results.
print(obj1$result)