TD_KNN Function | kNN | Teradata Vantage - TD_KNN - Analytics Database

Database Analytic Functions

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
VMware
Product
Analytics Database
Release Number
17.20
Published
June 2022
Language
English (United States)
Last Update
2024-10-04
dita:mapPath
gjn1627595495337.ditamap
dita:ditavalPath
ayr1485454803741.ditaval
dita:id
jmh1512506877710
lifecycle
latest
Product Category
Teradata Vantageā„¢

K-nearest Neighbors (k-NN) is a supervised learning technique that predicts the test data by computing nearest neighbors from training data based on a similarity (distance) metric. The algorithm does not construct a model from the training set, instead, it predicts the test data directly based on similarity with training data.

KNN uses a distance metric, such as Euclidean or Manhattan distance, to determine the similarity between data points. During the prediction phase, the algorithm calculates the distance between the new data point and all training examples, selects the K closest neighbors, and makes a prediction based on the majority class or average value of these neighbors. KNN is simple and easy to implement, but it can be computationally expensive and sensitive to irrelevant features.

The choice of K and the distance metric are important factors that affect the performance of KNN.
  • If K is too small, the algorithm may be too sensitive to outliers.
  • If K is too large, the algorithm may not be able to capture the underlying patterns in the data.

TD_KNN supports classification, regression, and neighbors model types. For classification, the category of the test data is based on a majority vote among the k nearest neighbors. For regression, the test data is assigned a score based on the mean similarity from its neighbors. For neighbors, the function returns the nearest neighbors.

The function supports up to 2018 features and 1000 labels for the classification model type.

The function internally calls the TD_VectorDistance function, where the computation complexity is O(N2), where N is the number of rows. Therefore, the query may run significantly longer as the number of rows increases in either the training table or the test table. In such cases, alternative algorithms such as decision trees, random forests, or neural networks may be more appropriate.

Because the training table for TD_KNN is a DIMENSION input, it is copied to the spool for each AMP before processing. Due to this reason, the size/scalability of this input is limited by the user spool in the database.