Teradata R Package
February 2020
Teradata Vantage


The KNN function uses training data objects to map test data objects to categories. The function is optimized for both small and large training sets. The function supports user-defined distance metrics and distance-weighted voting.


  td_knn_mle (
      train = NULL,
      test = NULL,
      k = NULL,
      response.column = NULL,
      id.column = NULL,
      distance.features = NULL,
      voting.weight = 0,
      customized.distance = NULL,
      force.mapreduce = FALSE,
      parblock.size = NULL,
      partition.key = NULL,
      accumulate = NULL,
      output.prob = FALSE,
      train.sequence.column = NULL,
      test.sequence.column = NULL



Required Argument.
Specifies the name of the tbl_teradata that contains the training data. Each row represents a classified data object.


Required Argument.
Specifies the name of the tbl_teradata that contains the test data to be classified by the td_knn_mle algorithm. Each row represents a test data object.


Required Argument.
Specifies the number of nearest neighbors to use for classifying the test data.
Types: numeric


Required Argument.
Specifies the name of the training column that contains the class label or classification of the classified data objects. Types: character


Required Argument.
Specifies the name of the testing column that uniquely identifies a data object. Types: character


Required Argument.
Specifies the names of the training tbl_teradata columns that the function uses to compute the distance between a test object and the training objects. The test tbl_teradata must also have these columns.
Types: character OR vector of Strings (character)


Optional Argument.
Specifies the voting weight of the distance between a test object and the training objects. The voting_weight must be a non-negative integer. The function calculates distance-weighted voting, w, with this equation: w = 1/POWER(distance, voting_weight) Where distance is the distance between the test object and the training object.
Default Value: 0
Types: numeric


Optional Argument.
Specifies the distance function. The first value of the parameter is the name of the JAR file that contains the distance metric class. The second value is the distance metric class defined in the jar file. For details on how to install a JAR file refer Teradata Vantage user guide. The default distance function is Euclidean distance.


Optional Argument.
Specifies whether to partition the training data. This causes the td_knn_mle function to load all training data into memory and use only the row function. If you specify TRUE, the td_knn_mle function partitions the training data and uses the map and reduce function.
Default Value: FALSE
Types: logical


Optional Argument.
Specifies the partition block size to use with force.mapreduce (TRUE). The recommended value depends on training data size and number of vworkers. For example, if your training data size is 10 billion and you have 10 vworkers, the recommended parblock.size is 1/n billion, where n is an integer that corresponds to your vworker nodes memory. Omitting this argument or specifying an inappropriate value for argument "parblock.size" can degrade performance.
Types: numeric


Optional Argument.
Specifies the name of the training tbl_teradata column that partitions data in parallel model. The default value is the first column of 'distance.features' argument.
Types: character


Optional Argument.
Specifies the names of test tbl_teradata columns to copy to the output table.
Types: character OR vector of Strings (character)
Note: This argument is supported when tdplyr is connected to Vantage 1.1 or later versions.


Optional Argument.
Specifies whether to display output probability for the predicted category.
Default Value: FALSE Types: logical
Note: This argument is supported when tdplyr is connected to Vantage 1.1 or later versions.


Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "train". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)


Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "test". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)


Function returns an object of class "td_knn_mle" which is a named list containing Teradata tbl objects.
Named list members can be referenced directly with the "$" operator using following names:

  1. output.table

  2. output


    # Get the current context/connection
    con <- td_get_context()$connection
    # Load example data.
    loadExampleData("knn_example", "computers_train1_clustered", "computers_test1")
    # The "computers_train1_clustered" and "computers_test1" remote tibbles 
    # contains five attributes of personal computers-price, speed, hard disk 
    # size, RAM, and screen size. 
    computers_train1_clustered <- tbl(con, "computers_train1_clustered")
    computers_test1 <- tbl(con, "computers_test1")
    # Example 1 - Map the test computer data to their respective categories
    td_knn_out <- td_knn_mle(train = computers_train1_clustered,
                         test = computers_test1,
                         k = 50,
                         response.column = "computer_category",
                         id.column = "id",
                         distance.features = c("price","speed","hd","ram","screen"),
                         voting.weight = 1