Teradata Package for R Function Reference | 17.20 - KMeans - Teradata Package for R - Look here for syntax, methods and examples for the functions included in the Teradata Package for R.

Teradata® Package for R Function Reference

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
VMware
Product
Teradata Package for R
Release Number
17.20
Published
March 2024
ft:locale
en-US
ft:lastEdition
2024-05-03
dita:id
TeradataR_FxRef_Enterprise_1720
lifecycle
latest
Product Category
Teradata Vantage

KMeans

Description

The td_means_sqle() function groups a set of observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid). This algorithm minimizes the objective function, that is, the total Euclidean distance of all data points from the center of the cluster as follows:

  1. Specify or randomly select k initial cluster centroids.

  2. Assign each data point to the cluster that has the closest centroid.

  3. Recalculate the positions of the k centroids.

  4. Repeat steps 2 and 3 until the centroids no longer move.

The algorithm doesn't necessarily find the optimal configuration as it depends significantly on the initial randomly selected cluster centers.
User can run the function multiple times to reduce the effect of this limitation.

Also, this function returns the within-cluster-squared-sum, which user can use to determine an optimal number of clusters using the Elbow method.
Notes:

  • This function doesn't consider the "data" and "centroids.data" input rows that have a NULL entry in the specified "target.columns".

  • The function can produce deterministic output across different machine configurations if user provide the "centroids.data".

  • The function randomly samples the initial centroids from the "data", if "centroids.data" not provided. In this case, use of "seed" argument makes the function output deterministic on a machine with an assigned configuration. However, using the "seed" argument won't guarantee deterministic output across machines with different configurations.

  • This function requires the UTF8 client character set for UNICODE data.

  • This function does not support Pass Through Characters (PTCs).

  • For information about PTCs, see Teradata Vantage™ - Analytics Database International Character Set Support.

  • This function does not support KanjiSJIS or Graphic data types.

Usage

  td_kmeans_sqle (
      data = NULL,
      centroids.data = NULL,
      id.column = NULL,
      target.columns = NULL,
      num.clusters = NULL,
      seed = NULL,
      threshold = 0.0395,
      iter.max = 10,
      num.init = 1,
      output.cluster.assignment = FALSE,
      ...
  )

Arguments

data

Required Argument.
Specifies the input tbl_teradata.
Types: tbl_teradata

centroids.data

Optional Argument.
Specifies the input tbl_teradata containing
set of initial centroids.
Types: tbl_teradata

id.column

Required Argument.
Specifies the input data column name that has the
unique identifier for each row in the input.
Types: character

target.columns

Required Argument.
Specifies the name(s) of the column(s) in "data" for clustering.
Types: character OR vector of Strings (character)

num.clusters

Optional Argument.
Specifies the number of clusters to be created.
Note:
This argument is not required if "centroids.data" provided.
Types: integer

seed

Optional Argument.
Specifies a non-negative integer value to randomly select the initial
cluster centroid positions from the input.
Note:

  • This argument is not required if "centroids.data" provided.

  • Random integer value will be used for "seed", if not passed.

Types: integer

threshold

Optional Argument.
Specifies the convergence threshold. The algorithm converges if the distance
between the centroids from the previous iteration and the current iteration is less than the specified value.
Default Value: 0.0395
Types: float OR integer

iter.max

Optional Argument.
Specifies the maximum number of iterations for the K-means algorithm.
The algorithm stops after performing the specified number of iterations even if the convergence criterion is not met.
Default Value: 10
Types: integer

num.init

Optional Argument.
Specifies the number of times, the k-means algorithm will be run with different
initial centroid seeds. The function will emit out the model having the least value of Total Within Cluster Squared Sum.
Note:
This argument is not required if "centroids.data" is provided.
Default Value: 1
Types: integer

output.cluster.assignment

Optional Argument.
Specifies whether to output Cluster Assignment information.
Default Value: FALSE
Types: logical

...

Specifies the generic keyword arguments SQLE functions accept. Below
are the generic keyword arguments:

persist:
Optional Argument.
Specifies whether to persist the results of the
function in a table or not. When set to TRUE, results are persisted in a table; otherwise, results are garbage collected at the end of the session.
Default Value: FALSE
Types: logical

volatile:
Optional Argument.
Specifies whether to put the results of the
function in a volatile table or not. When set to TRUE, results are stored in a volatile table, otherwise not.
Default Value: FALSE
Types: logical

Function allows the user to partition, hash, order or local order the input data. These generic arguments are available for each argument that accepts tbl_teradata as input and can be accessed as:

  • "<input.data.arg.name>.partition.column" accepts character or vector of character (Strings)

  • "<input.data.arg.name>.hash.column" accepts character or vector of character (Strings)

  • "<input.data.arg.name>.order.column" accepts character or vector of character (Strings)

  • "local.order.<input.data.arg.name>" accepts logical

Note:
These generic arguments are supported by tdplyr if the underlying SQL Engine function supports, else an exception is raised.

Value

Function returns an object of class "td_kmeans_sqle" which is a named list containing object of class "tbl_teradata".
Named list member(s) can be referenced directly with the "$" operator using the name(s):

  1. result

  2. model.data

Examples

  
    
    # Get the current context/connection.
    con <- td_get_context()$connection
    
    # Load the example data.
    loadExampleData("kmeans_example", "computers_train1")
    
    # Create tbl_teradata object.
    computers_train1 <- tbl(con, "computers_train1")
    
    # Check the list of available analytic functions.
    display_analytic_functions()
    
    # Example 1 : Grouping a set of observations into 2 clusters in which
    #             each observation belongs to the cluster with the nearest mean.
    KMeans_out <- td_kmeans_sqle(id.column="id",
                                 target.columns=c('price', 'speed'),
                                 data=computers_train1,
                                 num.clusters=2)
    
    # Print the result tbl_teradata objects.
    print(KMeans_out$result)
    print(KMeans_out$model.data)
    
    # Example 2 : Grouping a set of observations by specifying initial
    #             centroid data.
    
    # Get the set of initial centroids by accessing the group of rows
    # from input data.
    kmeans_initial_centroids_table <- computers_train1 
    
    KMeans_out_1 <- td_kmeans_sqle(id.column="id",
                                   target.columns=c('price', 'speed'),
                                   data=computers_train1,
                                   centroids.data=kmeans_initial_centroids_table)
    
    # Print the result tbl_teradata objects.
    print(KMeans_out_1$result)
    print(KMeans_out_1$model.data)