1.0 - 8.00 - KNN Example - Teradata Vantage

Teradata® Vantage Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
1.0
8.00
Release Date
May 2019
Content Type
Programming Reference
Publication ID
B700-4003-098K
Language
English (United States)

This example uses the following user-defined distance metric, defined in the Java class com.example.MyDistance:

package com.example;
import com.asterdata.ncluster.sqlmr.data.RowView;
import com.asterdata.sqlmr.analytics.classification.knn.distance.Distance;
public class MyDistance implements Distance {
    /**
    * calculate the distance between the test row and the training row.
    * note: 1.don't reverse the sequence of parameters
    *       2. the columns of trainingRowView is 'responseColumn, f1,f2,...,fn'
    *       3. the columns of testRowView is the same as TEST_TABLE
    *       4. all the trainingRowView and testRowView is zero-based
    *          (0 <= index && index < getColumnCount())
    *
    * @param testRowView
    *           stands for a point in the test data set
    * @param trainingRowView
    *           stands for a point in the training data set, the columns is the
    * columns in distanceFeatures argument
    * @return the double value of distance
    */
    @Override
    public double calculate(RowView testRowView, RowView trainingRowView) {
        return Math.abs(testRowView.getIntAt(1) - trainingRowView.getIntAt(1));
    }
}

Input

The TrainingTable has as dimensions five attributes of personal computers—price, speed, hard disk size, RAM, and screen size. The table has 5008 rows, categorized into eight price categories.

TrainingTable: computers_train1_clustered
id price speed hd ram screen computer_category
1 1499 25 80 4 14 SPECIAL
2 1795 33 85 2 14 SUPER
3 1595 25 170 4 15 SPECIAL
4 1849 25 170 8 14 SUPER
5 3295 33 340 16 14 HYPER
6 3695 66 340 16 14 UBER
7 1720 25 170 4 14 SPECIAL
8 1995 50 85 2 14 SUPER
9 2225 50 210 8 14 SUPER
12 2605 66 210 8 14 MEGA
13 2045 50 130 4 14 SUPER
14 2295 25 245 8 14 MEGA
16 2225 50 130 4 14 SUPER
17 1595 33 85 2 14 SPECIAL
18 2325 33 210 4 15 MEGA
19 2095 33 250 4 15 SUPER
20 4395 66 452 8 14 UBER
... ... ... ... ... ... ...

The test table has more than 1000 rows.

TestTable: computers_test1
id price speed hd ram screen
10 2575 50 210 4 15
11 2195 33 170 8 15
15 2699 50 212 8 14
29 3095 33 340 16 14
30 3244 66 245 8 14
38 3795 66 500 8 14
45 3495 50 340 16 14
46 2695 33 245 8 14
48 1749 25 120 4 14
51 2499 33 170 4 14
52 2395 33 130 4 14
59 2945 66 210 8 17
65 2195 66 85 2 14
66 1495 25 170 4 14
70 3095 66 245 8 14
86 1999 33 120 8 14
91 2975 50 210 4 17
92 2145 66 130 4 14
93 2420 33 170 8 15
94 2505 50 210 8 14
104 2999 66 330 4 15
... ... ... ... ... ...

SQL Call

SELECT * FROM KNN (
  ON computers_train1_clustered AS TrainingTable
  ON computers_test1 AS TestTable
  OUT TABLE OutputTable (knn_output)
  USING
  K (50)
  ResponseColumn ('computer_category')
  DistanceFeatures ('price', 'speed', 'hd', 'ram', 'screen')
  VotingWeight (1)
  IDColumn ('id')
) AS dt;

Output

message
Successful!
 The final result is successfully stored in the table specified in OutputTable argument.

This query returns the following table:

SELECT * FROM knn_output ORDER BY id;
id computer_category
10 MEGA
11 SUPER
15 MEGA
29 HYPER
30 HYPER
38 UBER
45 UBER
46 MEGA
48 SPECIAL
51 MEGA
52 MEGA
59 HYPER
65 SUPER
66 SPECIAL
70 HYPER
86 SUPER
91 HYPER
92 SUPER
93 MEGA
94 MEGA
104 HYPER
... ...