- OutputTable
- [Optional] Specify the name of the output table.
- K
- Specify the number of nearest neighbors to use for classifying the test data. The choice of k presents a bias-variance trade-off. A higher value of k typically results in lower variance and smoother neighboring boundaries but increases bias, and the reverse. If there are more than k neighbors at the same distance, k nearest neighbors are randomly chosen. This adds nondeterminism to the algorithm and may result in variation in classification results. To ensure deterministic behavior, use the UniqueID Syntax Element.
- ResponseColumn
- Specify the name of the TrainingData column that contains the class label or classification of the classified data objects.
- TrainingIDColumn
- [Required with UniqueID, ignored otherwise.] You need this argument only to obtain deterministic (repeatable) results. Specify the name of the TrainingData column that uniquely identifies a data object. For information about UniqueID, see Nondeterministic Results and UniqueID Syntax Element.
- TestIDColumn
- Specify the name of the TestData column that uniquely identifies a data object.
- TargetColumns
- Specify the names of the TrainingData columns that the function uses to compute the distance between a test object and the training objects. The TestData table must also have these columns.
- VotingWeight
- [Optional] Specify the voting weight of the distance between a test object and the training objects. The voting_weight must be a nonnegative integer.
- CustomizedDistance
- [Optional] Specify the distance function. The parameter jar is the name of the JAR file that contains the distance metric class. The parameter distance_class is the distance metric class defined in the jar file. This JAR file must be installed on ML Engine.
- ForceMapReduce
- [Optional] Specify whether to partition the training and test data and use the map-and reduce function.
- TrainBlockSize
- [Optional with ForceMapReduce ('true'), ignored otherwise.] Specify the partition block size of training data to use with ForceMapReduce ('true'). Specifying an optimal value for this syntax element may improve performance. The optimal value depends on the size of the training data and the vworker configuration. Because rows in a partition are processed together, a higher value improves performance, but the maximum value is limited by the memory of the vworker. For example, if the training data set has 1024 rows, specifying TrainBlockSize('16') partitions the input data into 64 partitions of 16 rows each. Similarly, TrainBlockSize('128') creates 8 (1024/128) partitions of 128 rows each. The partitions are distributed evenly across the number of vworkers available. An optimal value is typically in the range [1000, 10000].
- TestBlockSize
- [Optional with ForceMapReduce ('true'), ignored otherwise.] As with TrainBlockSize, specifying an optimal value for this syntax element may improve performance. TestBlockSize impacts performance more significantly than TrainBlockSize, because the function keeps k values in memory for each test data object.
- OutputProb
- [Required to be 'true' with Responses, optional otherwise.] Specify whether to output the calculated probability for each test data object.
- Responses
- [Optional with OutputProb ('true'), disallowed otherwise.] Specify responses for which to output probability.
- Accumulate
- Specify the names of the TestData columns to copy to OutputTable.