1.1 - 8.10 - KNN Syntax Elements - Teradata Vantage

Teradata Vantage™ - Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
1.1
8.10
Release Date
October 2019
Content Type
Programming Reference
Publication ID
B700-4003-079K
Language
English (United States)
OutputTable
[Optional] Specify the name of the output table.
Default behavior: The function displays the output to the screen.
K
Specify the number of nearest neighbors to use for classifying the test data. The choice of k presents a bias-variance trade-off. A higher value of k typically results in lower variance and smoother neighboring boundaries but increases bias, and the reverse. If there are more than k neighbors at the same distance, k nearest neighbors are randomly chosen. This adds nondeterminism to the algorithm and may result in variation in classification results. To ensure deterministic behavior, use the UniqueID Syntax Element.
ResponseColumn
Specify the name of the TrainingData column that contains the class label or classification of the classified data objects.
IDColumn
Specify the name of the TestData column that uniquely identifies a data object.
DistanceFeatures
Specify the names of the TrainingData columns that the function uses to compute the distance between a test object and the training objects. The TestData table must also have these columns.
A null value in a column is treated as infinite distance.
While computing nearest neighbors, the function considers only neighbors from training data, not the already predicted neighbors from testing data.
If different features have different units of measurement, Teradata recommends normalizing all data points to be in the range [0,1].
As the number of DistanceFeatures increases, the distances between all data points become small and the usefulness of the distance measure decreases. If necessary, reduce the number of features for distance computation by using feature selection or dimensionality reduction methods.
VotingWeight
[Optional] Specify the voting weight of the distance between a test object and the training objects. The voting_weight must be a nonnegative integer.
The function calculates distance-weighted voting, w, with this equation:

w = 1/POWER(distance, voting_weight)

Where distance is the distance between the test object and the training object.
Default: 0
CustomizedDistance
[Optional] Specify the distance function. The parameter jar is the name of the JAR file that contains the distance metric class. The parameter distance_class is the distance metric class defined in the jar file. This JAR file must be installed on ML Engine.
ML Engine does not support the creation of new customized distance classes. However, it does support existing JAR files—for installation instructions, see Teradata Vantage™ User Guide, B700-4002.
Default: Euclidean distance
ForceMapReduce
[Optional] Specify whether to partition the training data. If you specify 'true', the KNN function partitions the training data and uses the map-and reduce function.
If you specify ForceMapReduce, you must also specify PartitionColumn.
Default: 'false' (The function loads all training data into memory and uses only the row function.)
PartitionColumn
[Required if ForceMapReduce is true, ignored otherwise.] Specify name of column by which input table can be uniformly partitioned. The partition_column must contain INTEGER, BIGINT, or BYTEINT values. A unique identifier for each row results in more uniformed partitioning, for better performance.
PartitionBlockSize
[Optional] Specify the partition block size to use with ForceMapReduce ('true'). Specifying an optimal value for this syntax element may improve performance. The optimal value depends on the size of the training data and the vworker configuration. Because rows in a partition are processed together, a higher value improves performance, but the maximum value is limited by the memory of the vworker. For example, if the training data set has 1024 rows, specifying PartitionBlockSize('16') partitions the input data into 64 partitions of 16 rows each. Similarly, PartitionBlockSize('128') creates 8 (1024/128) partitions of 128 rows each. The partitions are distributed evenly across the number of vworkers available.
If you omit this syntax element, the function calculates the partition block size that best fits available memory. The calculated value is optimal if partition_column has a unique INTEGER for each row.
OutputProb
Specify whether to output the calculated probability for each observation.
Default: 'false'
Accumulate
Specify the names of the TrainingTable columns to copy to OutputTable.