Teradata Package for R Function Reference | 17.00 - KNN - Teradata Package for R - Look here for syntax, methods and examples for the functions included in the Teradata Package for R.

Teradata® Package for R Function Reference

Product
Teradata Package for R
Release Number
17.00
Published
July 2021
Language
English (United States)
Last Update
2023-08-08
dita:id
B700-4007
NMT
no
Product Category
Teradata Vantage
KNN

Description

The KNN function uses training data objects to map test data objects to categories. The function is optimized for both small and large training sets. The function supports user-defined distance metrics and distance-weighted voting.

Usage

  td_knn_mle (
      train = NULL,
      test = NULL,
      k = NULL,
      response.column = NULL,
      id.column = NULL,
      distance.features = NULL,
      voting.weight = 0,
      customized.distance = NULL,
      force.mapreduce = FALSE,
      parblock.size = NULL,
      partition.key = NULL,
      accumulate = NULL,
      output.prob = FALSE,
      train.sequence.column = NULL,
      test.sequence.column = NULL
  )

Arguments

train

Required Argument.
Specifies the name of the tbl_teradata that contains the training data. Each row represents a classified data object.

test

Required Argument.
Specifies the name of the tbl_teradata that contains the test data to be classified by the td_knn_mle function. Each row represents a test data object.

k

Required Argument.
Specifies the number of nearest neighbors to use for classifying the test data.
Types: integer

response.column

Required Argument.
Specifies the name of the training tbl_teradata column that contains the class label or classification of the classified data objects.
Types: character

id.column

Required Argument.
Specifies the name of the testing tbl_teradata column that uniquely identifies a data object.
Types: character

distance.features

Required Argument.
Specifies the names of the training tbl_teradata columns that the function uses to compute the distance between a test object and the training objects. The test tbl_teradata must also have these columns.
Types: character OR vector of Strings (character)

voting.weight

Optional Argument.
Specifies the voting weight of the distance between a test object and the training objects. The voting_weight must be a non-negative integer. The function calculates distance-weighted voting, w, with this equation: w = 1/POWER(distance, voting_weight) Where distance is the distance between the test object and the training object.
Default Value: 0
Types: numeric

customized.distance

Optional Argument.
Specifies the distance function. The first value of the parameter is the name of the JAR file that contains the distance metric class. The second value is the distance metric class defined in the JAR file. For details on how to install a JAR file, see Teradata Vantage user guide. The default distance function is Euclidean distance.
Types: character OR vector of characters

force.mapreduce

Optional Argument.
Specifies whether to partition the training data. This causes the td_knn_mle function to load all training data into memory and use only the row function. If you specify TRUE, the td_knn_mle function partitions the training data and uses the map and reduce function.
Default Value: FALSE
Types: logical

parblock.size

Optional Argument.
Specifies the partition block size to use with force.mapreduce (TRUE). The recommended value depends on training data size and number of vworkers. For example, if your training data size is 10 billion and you have 10 vworkers, the recommended parblock.size is 1/n billion, where n is an integer that corresponds to your vworker nodes memory. Omitting this argument or specifying an inappropriate value for argument "parblock.size" can degrade performance.
Types: integer

partition.key

Optional Argument.
Specifies the name of the training tbl_teradata column that partitions data in parallel model. The default value is the first column of "distance.features" argument.
Types: character

accumulate

Optional Argument.
Specifies the names of test tbl_teradata columns to copy to the output tbl_teradata.
Note: This argument is supported when tdplyr is connected to Vantage 1.1 or later versions.
Types: character OR vector of Strings (character)

output.prob

Optional Argument.
Specifies whether to display output probability for the predicted category.
Note: This argument is supported when tdplyr is connected to Vantage 1.1 or later versions.
Default Value: FALSE
Types: logical

train.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "train". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)

test.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "test". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)

Value

Function returns an object of class "td_knn_mle" which is a named list containing objects of class "tbl_teradata".
Named list members can be referenced directly with the "$" operator using the following names:

  1. output.table

  2. output

Examples

  
    # Get the current context/connection
    con <- td_get_context()$connection
    
    # Load example data.
    loadExampleData("knn_example", "computers_train1_clustered", "computers_test1")

    # Both "computers_train1_clustered" tb_teradata and "computers_test1" tbl_teradata
    # contains five attributes of personal computers-price, speed, hard disk
    # size, RAM, and screen size.
    computers_train1_clustered <- tbl(con, "computers_train1_clustered")
    computers_test1 <- tbl(con, "computers_test1")

    # Example 1: Map the test computer data to their respective categories.
    td_knn_out <- td_knn_mle(train = computers_train1_clustered,
                             test = computers_test1,
                             k = 50,
                             response.column = "computer_category",
                             id.column = "id",
                             distance.features = c("price","speed","hd","ram","screen"),
                             voting.weight = 1
                             )