KNN Syntax Elements - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
9.02
9.01
2.0
1.3
Published
February 2022
Language
English (United States)
Last Update
2022-02-10
dita:mapPath
rnn1580259159235.ditamap
dita:ditavalPath
ybt1582220416951.ditaval
dita:id
B700-4003
lifecycle
previous
Product Category
Teradata Vantage™
OutputTable
[Optional] Specify the name of the output table.
Default behavior: The function displays the output to the screen.
K
Specify the number of nearest neighbors to use for classifying the test data. The choice of k presents a bias-variance trade-off. A higher value of k typically results in lower variance and smoother neighboring boundaries but increases bias, and the reverse. If there are more than k neighbors at the same distance, k nearest neighbors are randomly chosen. This adds nondeterminism to the algorithm and may result in variation in classification results. To ensure deterministic behavior, use the UniqueID Syntax Element.
ResponseColumn
Specify the name of the TrainingData column that contains the class label or classification of the classified data objects.
TrainingIDColumn
[Required with UniqueID, ignored otherwise.] You need this argument only to obtain deterministic (repeatable) results. Specify the name of the TrainingData column that uniquely identifies a data object. For information about UniqueID, see Nondeterministic Results and UniqueID Syntax Element.
TestIDColumn
Specify the name of the TestData column that uniquely identifies a data object.
TargetColumns
Specify the names of the TrainingData columns that the function uses to compute the distance between a test object and the training objects. The TestData table must also have these columns.
A null value in a column is treated as infinite distance.
While computing nearest neighbors, the function considers only neighbors from training data, not the already predicted neighbors from testing data.
If different features have different units of measurement, Teradata recommends normalizing all data points to be in the range [0,1].
As the number of target columns increases, the distances between all data points become small and the usefulness of the distance measure decreases. If necessary, reduce the number of features for distance computation by using feature selection or dimensionality reduction methods.
VotingWeight
[Optional] Specify the voting weight of the distance between a test object and the training objects. The voting_weight must be a nonnegative integer.
The function calculates distance-weighted voting, w, with this equation:

w = 1/POWER(distance, voting_weight)

Where distance is the distance between the test object and the training object.
Default: 0
CustomizedDistance
[Optional] Specify the distance function. The parameter jar is the name of the JAR file that contains the distance metric class. The parameter distance_class is the distance metric class defined in the jar file. This JAR file must be installed on ML Engine.
ML Engine does not support the creation of new customized distance classes. However, it does support existing JAR files—for installation instructions, see Teradata Vantage™ User Guide, B700-4002.
Default: Euclidean distance
ForceMapReduce
[Optional] Specify whether to partition the training and test data and use the map-and reduce function.
Specify 'true' if training data is too large to fit the memory of all vworkers combined.
Default: 'false' (The function loads all training data into vworkers' memory.)
TrainBlockSize
[Optional with ForceMapReduce ('true'), ignored otherwise.] Specify the partition block size of training data to use with ForceMapReduce ('true'). Specifying an optimal value for this syntax element may improve performance. The optimal value depends on the size of the training data and the vworker configuration. Because rows in a partition are processed together, a higher value improves performance, but the maximum value is limited by the memory of the vworker. For example, if the training data set has 1024 rows, specifying TrainBlockSize('16') partitions the input data into 64 partitions of 16 rows each. Similarly, TrainBlockSize('128') creates 8 (1024/128) partitions of 128 rows each. The partitions are distributed evenly across the number of vworkers available. An optimal value is typically in the range [1000, 10000].
Default behavior: The function calculates the training block size that best fits available memory.
TestBlockSize
[Optional with ForceMapReduce ('true'), ignored otherwise.] As with TrainBlockSize, specifying an optimal value for this syntax element may improve performance. TestBlockSize impacts performance more significantly than TrainBlockSize, because the function keeps k values in memory for each test data object.
Default behavior: The function calculates the test block size that best fits available memory.
OutputProb
[Required to be 'true' with Responses, optional otherwise.] Specify whether to output the calculated probability for each test data object.
Default: 'false'
Responses
[Optional with OutputProb ('true'), disallowed otherwise.] Specify responses for which to output probability.
If you specify OutputProb ('true') and omit Responses, the function adds the column prob to the output table.
If you specify OutputProb ('true') and specify n responses, the function adds n columns to the output table.
Default behavior: Output only the probability of the predicted class.
Accumulate
Specify the names of the TestData columns to copy to OutputTable.