Teradata Package for R Function Reference | 17.00 - KMeans - Teradata Package for R - Look here for syntax, methods and examples for the functions included in the Teradata Package for R.

Teradata® Package for R Function Reference

Product
Teradata Package for R
Release Number
17.00
Published
July 2021
Language
English (United States)
Last Update
2023-08-08
dita:id
B700-4007
NMT
no
Product Category
Teradata Vantage
KMeans

Description

The KMeans function takes a data set and outputs the centroids of its clusters and, optionally, the clusters themselves. The algorithm groups a set of observations into k clusters with each observation assigned to the cluster with the nearest centroid, or mean. The algorithm minimizes an objective function; in the KMeans function, the objective function is the total Euclidean distance of all data points from the center of the cluster to which they are assigned.

Usage

  td_kmeans_mle (
      data = NULL,
      centers = NULL,
      iter.max = 10,
      initial.seeds = NULL,
      seed = NULL,
      unpack.columns = FALSE,
      centroids.table = NULL,
      threshold = 0.0395,
      data.sequence.column = NULL,
      centroids.table.sequence.column = NULL
  )

Arguments

data

Required Argument.
Specifies the tbl_teradata containing the list of features by which we are clustering the data.

centers

Optional Argument.
Specifies the number of clusters to generate from the data.
Note: With "centers", the function uses a nondeterministic algorithm and the function supports up to 1543 dimensions.
Types: integer

iter.max

Optional Argument.
Specifies the maximum number of iterations that the algorithm runs before quitting if the convergence threshold has not been met.
Default Value: 10
Types: integer

initial.seeds

Optional Argument.
Specifies the initial seed means as strings of underscore-delimited numeric values. For example, this clause initializes eight clusters in eight-dimensional space: Means("50_50_50_50_50_50_50_50", "150_150_150_150_150_150_150_150", "250_250_250_250_250_250_250_250", "350_350_350_350_350_350_350_350", "450_450_450_450_450_450_450_450", "550_550_550_550_550_550_550_550", "650_650_650_650_650_650_650_650", "750_750_750_750_750_750_750_750"). The dimensionality of the means must match the dimensionality of the data (that is, each mean must have n numbers in it, where n is the number of input columns minus one). By default, the algorithm chooses the initial seed means randomly.
Note: With "initial.seeds", the function uses a deterministic algorithm and the function supports up to 1596 dimensions.
Types: character OR vector of characters

seed

Optional Argument.
Specifies a random seed for the algorithm.
Types: integer

unpack.columns

Optional Argument.
Specifies whether the means for each centroid appear unpacked (that is, in separate columns) in the "clusters.centroids" output tbl_teradata. By default, the function concatenates the means for the centroids and outputs the result in a single VARCHAR column.
Default Value: FALSE
Types: logical

centroids.table

Optional Argument.
Specifies the tbl_teradata that contains the initial seed means for the clusters. The schema of the "centroids.table" tbl_teradata depends on the value of the "unpack.columns" argument.
Note: With "centroids.table", the function uses a deterministic algorithm and the function supports up to 1596 dimensions.

threshold

Optional Argument.
Specifies the convergence threshold. When the centroids move by less than this amount, the algorithm has converged.
Default Value: 0.0395
Types: numeric

data.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)

centroids.table.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "centroids.table". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)

Value

Function returns an object of class "td_kmeans_mle" which is a named list containing objects of class "tbl_teradata".
Named list members can be referenced directly with the "$" operator using the following names:

  1. clusters.centroids

  2. clustered.output

  3. output

Examples

  
    # Get the current context/connection.
    con <- td_get_context()$connection
    
    # Load example data.
    loadExampleData("kmeans_example", "computers_train1")
    
    # Create object(s) of class "tbl_teradata".
    computers_train1 <- tbl(con, "computers_train1")
    
    # These examples use different arguments to find clusters based on the five
    # attributes of personal computers data in the input tbl_teradata.
    
    # Example 1 - Using "centers" to specify the number of clusters to generate.
    td_kmeans_out1 <- td_kmeans_mle(data = computers_train1,
                                    centers = 8,
                                    iter.max = 10,
                                    threshold = 0.05
                                    )
    
    # Example 2 - Using "centers" to specify the number of clusters to generate, and
    # setting "unpack.columns" to TRUE to make sure the centroids appear unpacked in
    # the "clusters.centroids" output tbl_teradata..
    td_kmeans_out2 <- td_kmeans_mle(data = computers_train1,
                                    centers = 8,
                                    iter.max = 10,
                                    unpack.columns = TRUE,
                                    threshold = 0.05
                                    )
    
    # Example 3 - Using "initial.seeds" to specify the initial seed means.
    td_kmeans_out3 <- td_kmeans_mle(data = computers_train1,
                                    initial.seeds = c("2249_51_408_8_14",
                                                      "2165_51_398_7_14.6",
                                                      "2182_51_404_7_14.6",
                                                      "2204_55_372_7.19_14.6",
                                                      "2419_44_222_6.6_14.3",
                                                      "2394_44.3_277_7.3_14.5",
                                                      "2326_43.6_301_7.11_14.3",
                                                      "2288_44_325_7_14.4")
                                    )