TD_KNN Usage Notes | kNN | Teradata Vantage - TD_KNN Usage Notes - Analytics Database

Database Analytic Functions

Analytics Database
Release Number
June 2022
English (United States)
Last Update
Product Category
Teradata Vantage™

K-Nearest Neighbors (KNN) is a supervised learning algorithm that is commonly used for classification and regression problems. It's a non-parametric, instance-based, and lazy learning algorithm that operates under the principle of similarity-based classification, where a sample is classified based on the class labels of its nearest neighbors in the feature space.

Given a set of training samples X = {x_1, x_2, ..., x_n}, where each x_i is a d-dimensional feature vector and its corresponding class label y_i, the KNN algorithm works as follows:

For a new sample x', the algorithm computes the distance between x' and each training sample x_i in the feature space. A common choice of distance metric is the Euclidean distance:

Eucliean distance algorithm

where x_ij is the j-th feature of x_i and x'j is the j-th feature of x'

The "k" nearest neighbors to x' are then selected, where "k" is a user-defined constant integer number.

The class label of x' is determined by the majority vote of the class labels of its k nearest neighbors.

Mathematically, the predicted class label ŷ of x' can be expressed as:

KNN predicted class label

where c_j is a class label, y_i is the class label of the i-th nearest neighbor, and the sum is over the k nearest neighbors.

KNN is a simple, fast, and versatile algorithm that has been applied to a wide range of problems. However, its performance can be sensitive to:
  • Choice of distance metric
  • The number of nearest neighbors
  • The presence of irrelevant or noisy features
  • Large datasets (Computation Cost)
KNN has the following benefits:
  • Is easy to implement
  • Has no training phase
  • Can be used for classification, regression, and anomaly detection problems

In conclusion, KNN is a simple and powerful algorithm that has been widely used for various tasks in machine learning and computer science. Its strengths include its ability to handle non-linear relationships, multi-class problems, and its ease of implementation. However, its performance can be sensitive to the choice of "k", the distance metric, and the presence of irrelevant or noisy features, and it's important to carefully evaluate its performance for each specific problem.