TD_KNN Function | kNN | Teradata Vantage - TD_KNN

TD_KNN Function | kNN | Teradata Vantage - TD_KNN - Teradata Vantage

Teradata® VantageCloud Lake

Deployment

VantageCloud

Edition

Lake

Product

Teradata Vantage

Published

January 2023

Language

English (United States)

Last Update

2024-04-03

dita:mapPath

phg1621910019905.ditamap

dita:ditavalPath

pny1626732985837.ditaval

dita:id

phg1621910019905

K-nearest Neighbors (k-NN) is a supervised learning technique that predicts the test data by computing nearest neighbors from training data based on a similarity (distance) metric. The algorithm does not construct a model from the training set, instead, it predicts the test data directly based on similarity with training data.

KNN uses a distance metric, such as Euclidean or Manhattan distance, to determine the similarity between data points. During the prediction phase, the algorithm calculates the distance between the new data point and all training examples, selects the K closest neighbors, and makes a prediction based on the majority class or average value of these neighbors. KNN is simple and easy to implement, but it can be computationally expensive and sensitive to irrelevant features.

The choice of K and the distance metric are important factors that affect the performance of KNN.

If K is too small, the algorithm may be too sensitive to outliers.
If K is too large, the algorithm may not be able to capture the underlying patterns in the data.

TD_KNN supports classification, regression, and neighbors model types. For classification, the category of the test data is based on a majority vote among the k nearest neighbors. For regression, the test data is assigned a score based on the mean similarity from its neighbors. For neighbors, the function returns the nearest neighbors.

The function supports up to 2018 features and 1000 labels for the classification model type.

The function internally calls the TD_VectorDistance function, where the computation complexity is O(N2), where N is the number of rows. Therefore, the query may run significantly longer as the number of rows increases in either the training table or the test table. In such cases, alternative algorithms such as decision trees, random forests, or neural networks may be more appropriate.

Because the training table for TD_KNN is a DIMENSION input, it is copied to the spool for each AMP before processing. Due to this reason, the size/scalability of this input is limited by the user spool in the database.

Function Information