KNN
Description
The KNN function uses training data objects to map test data objects
to categories. The function is optimized for both small and large
training sets. The function supports user-defined distance metrics
and distance-weighted voting.
Usage
td_knn_mle (
train = NULL,
test = NULL,
k = NULL,
response.column = NULL,
id.column = NULL,
distance.features = NULL,
voting.weight = 0,
customized.distance = NULL,
force.mapreduce = FALSE,
parblock.size = NULL,
partition.key = NULL,
accumulate = NULL,
output.prob = FALSE,
train.sequence.column = NULL,
test.sequence.column = NULL
)
Arguments
train |
Required Argument.
Specifies the name of the tbl_teradata that contains the training
data. Each row represents a classified data object.
|
test |
Required Argument.
Specifies the name of the tbl_teradata that contains the test data to
be classified by the td_knn_mle function. Each row represents a test
data object.
|
k |
Required Argument.
Specifies the number of nearest neighbors to use for classifying the
test data.
Types: integer
|
response.column |
Required Argument.
Specifies the name of the training tbl_teradata column that contains
the class label or classification of the classified data objects.
Types: character
|
id.column |
Required Argument.
Specifies the name of the testing tbl_teradata column that uniquely
identifies a data object.
Types: character
|
distance.features |
Required Argument.
Specifies the names of the training tbl_teradata columns that the
function uses to compute the distance between a test object and the
training objects. The test tbl_teradata must also have these
columns.
Types: character OR vector of Strings (character)
|
voting.weight |
Optional Argument.
Specifies the voting weight of the distance between a test object and
the training objects. The voting_weight must be a non-negative
integer. The function calculates distance-weighted voting, w, with this
equation: w = 1/POWER(distance, voting_weight) Where distance is the
distance between the test object and the training object.
Default Value: 0
Types: numeric
|
customized.distance |
Optional Argument.
Specifies the distance function. The first value of the parameter is the name of the
JAR file that contains the distance metric class. The second value
is the distance metric class defined in the JAR file. For details on how to install
a JAR file, see Teradata Vantage user guide.
The default distance function is Euclidean distance.
Types: character OR vector of characters
|
force.mapreduce |
Optional Argument.
Specifies whether to partition the training data. This causes the
td_knn_mle function to load all training data into memory and use only
the row function. If you specify TRUE, the td_knn_mle function
partitions the training data and uses the map and reduce function.
Default Value: FALSE
Types: logical
|
parblock.size |
Optional Argument.
Specifies the partition block size to use with force.mapreduce
(TRUE). The recommended value depends on training data size and
number of vworkers. For example, if your training data size is 10
billion and you have 10 vworkers, the recommended parblock.size is
1/n billion, where n is an integer that corresponds to your vworker
nodes memory. Omitting this argument or specifying an inappropriate
value for argument "parblock.size" can degrade performance.
Types: integer
|
partition.key |
Optional Argument.
Specifies the name of the training tbl_teradata column that partitions
data in parallel model. The default value is the first column of
"distance.features" argument.
Types: character
|
accumulate |
Optional Argument.
Specifies the names of test tbl_teradata columns to copy to the
output tbl_teradata.
Note: This argument is supported when tdplyr is connected to Vantage 1.1
or later versions.
Types: character OR vector of Strings (character)
|
output.prob |
Optional Argument.
Specifies whether to display output probability for the predicted
category.
Note: This argument is supported when tdplyr is connected to Vantage 1.1
or later versions.
Default Value: FALSE
Types: logical
|
train.sequence.column |
Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row
of the input argument "train". The argument is used to ensure
deterministic results for functions which produce results that vary
from run to run.
Types: character OR vector of Strings (character)
|
test.sequence.column |
Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row
of the input argument "test". The argument is used to ensure
deterministic results for functions which produce results that vary
from run to run.
Types: character OR vector of Strings (character)
|
Value
Function returns an object of class "td_knn_mle" which is a named
list containing objects of class "tbl_teradata".
Named list members can be referenced directly with the "$" operator
using the following names:
output.table
output
Examples
# Get the current context/connection
con <- td_get_context()$connection
# Load example data.
loadExampleData("knn_example", "computers_train1_clustered", "computers_test1")
# Both "computers_train1_clustered" tb_teradata and "computers_test1" tbl_teradata
# contains five attributes of personal computers-price, speed, hard disk
# size, RAM, and screen size.
computers_train1_clustered <- tbl(con, "computers_train1_clustered")
computers_test1 <- tbl(con, "computers_test1")
# Example 1: Map the test computer data to their respective categories.
td_knn_out <- td_knn_mle(train = computers_train1_clustered,
test = computers_test1,
k = 50,
response.column = "computer_category",
id.column = "id",
distance.features = c("price","speed","hd","ram","screen"),
voting.weight = 1
)