Teradata R Package Function Reference - 16.20 - KMeans - Teradata R Package

Teradata® R Package Function Reference

prodname
Teradata R Package
vrm_release
16.20
created_date
February 2020
category
Programming Reference
featnum
B700-4007-098K

Description

The KMeans function takes a data set and outputs the centroids of its clusters and, optionally, the clusters themselves.

Usage

  td_kmeans_mle (
      data = NULL,
      centers = NULL,
      iter.max = 10,
      initial.seeds = NULL,
      seed = NULL,
      unpack.columns = FALSE,
      centroids.table = NULL,
      threshold = 0.0395,
      data.sequence.column = NULL,
      centroids.table.sequence.column = NULL
  )

Arguments

data

Required Argument. Specifies the input dataset containing the list of features by which we are clustering the data.

centers

Optional Argument.
Specifies the number of clusters to generate from the data.
Note: With centers, the function uses a nondeterministic algorithm and the function supports up to 1543 dimensions.

iter.max

Optional Argument.
Specifies the maximum number of iterations that the algorithm runs before quitting if the convergence threshold has not been met.
Default Value: 10

initial.seeds

Optional Argument.
Specifies the initial seed means as strings of underscore-delimited DOUBLE PRECISION values. For example, this clause initializes eight clusters in eight-dimensional space: Means("50_50_50_50_50_50_50_50", "150_150_150_150_150_150_150_150", "250_250_250_250_250_250_250_250", "350_350_350_350_350_350_350_350", "450_450_450_450_450_450_450_450", "550_550_550_550_550_550_550_550", "650_650_650_650_650_650_650_650", "750_750_750_750_750_750_750_750") The dimensionality of the means must match the dimensionality of the data (that is, each mean must have n numbers in it, where n is the number of input columns minus one). By default, the algorithm chooses the initial seed means randomly.
Note: With initial.seeds, the function uses a deterministic algorithm and the function supports up to 1596 dimensions.

seed

Optional Argument.
Sets a random seed for the algorithm.

unpack.columns

Optional Argument.
Specifies whether the means for each centroid appear unpacked (that is, in separate columns) in output_table. By default, the function concatenates the means for the centroids and outputs the result in a single VARCHAR column.
Default Value: FALSE

centroids.table

Optional Argument.
Specifies the input dataset that contains the initial seed means for the clusters. The schema of the centroids table depends on the value of the unpack.columns argument.
Note: With centroids.table, the function uses a deterministic algorithm and the function supports up to 1596 dimensions.

threshold

Optional Argument.
Specifies the convergence threshold. When the centroids move by less than this amount, the algorithm has converged.
Default Value: 0.0395

data.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.

centroids.table.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "centroids.table". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.

Value

Function returns an object of class "td_kmeans_mle" which is a named list containing Teradata tbl objects.
Named list members can be referenced directly with the "$" operator using following names:

  1. clusters.centroids

  2. clustered.output

  3. output

Examples

    # Get the current context/connection
    con <- td_get_context()$connection
    
    # Load example data.
    loadExampleData("kmeans_example", "computers_train1")
    
    # Create remote tibble objects.
    computers_train1 <- tbl(con, "computers_train1")
    
    # Example 1 -
    td_kmeans_out1 <- td_kmeans_mle(data = computers_train1,
                        centers = 8,
                        iter.max = 10,
                        threshold = 0.05
                        )
    
    # Example 2 -
    td_kmeans_out2 <- td_kmeans_mle(data = computers_train1,
                        centers = 8,
                        iter.max = 10,
                        unpack.columns = TRUE,
                        threshold = 0.05
                        )
    
    # Example 3 -
    td_kmeans_out3 <- td_kmeans_mle(data = computers_train1,
                        initial.seeds = c("2249_51_408_8_14", "2165_51_398_7_14.6",
                                          "2182_51_404_7_14.6", "2204_55_372_7.19_14.6","2419_44_222_6.6_14.3",
                                          "2394_44.3_277_7.3_14.5"," 2326_43.6_301_7.11_14.3",
                                          "2288_44_325_7_14.4")
                        )