This example uses the following user-defined distance metric, defined in the Java class com.example.MyDistance:
package com.example; import com.asterdata.ncluster.sqlmr.data.RowView; import com.asterdata.sqlmr.analytics.classification.knn.distance.Distance; public class MyDistance implements Distance { /** * calculate the distance between the test row and the training row. * note: 1.don't reverse the sequence of parameters * 2. the columns of trainingRowView is 'responseColumn, f1,f2,...,fn' * 3. the columns of testRowView is the same as TEST_TABLE * 4. all the trainingRowView and testRowView is zero-based * (0 <= index && index < getColumnCount()) * * @param testRowView * stands for a point in the test data set * @param trainingRowView * stands for a point in the training data set, the columns is the * columns in distanceFeatures argument * @return the double value of distance */ @Override public double calculate(RowView testRowView, RowView trainingRowView) { return Math.abs(testRowView.getIntAt(1) - trainingRowView.getIntAt(1)); } }
Input
The TrainingTable has as dimensions five attributes of personal computers—price, speed, hard disk size, RAM, and screen size. The table has 5008 rows, categorized into eight price categories.
id | price | speed | hd | ram | screen | computer_category |
---|---|---|---|---|---|---|
1 | 1499 | 25 | 80 | 4 | 14 | SPECIAL |
2 | 1795 | 33 | 85 | 2 | 14 | SUPER |
3 | 1595 | 25 | 170 | 4 | 15 | SPECIAL |
4 | 1849 | 25 | 170 | 8 | 14 | SUPER |
5 | 3295 | 33 | 340 | 16 | 14 | HYPER |
6 | 3695 | 66 | 340 | 16 | 14 | UBER |
7 | 1720 | 25 | 170 | 4 | 14 | SPECIAL |
8 | 1995 | 50 | 85 | 2 | 14 | SUPER |
9 | 2225 | 50 | 210 | 8 | 14 | SUPER |
12 | 2605 | 66 | 210 | 8 | 14 | MEGA |
13 | 2045 | 50 | 130 | 4 | 14 | SUPER |
14 | 2295 | 25 | 245 | 8 | 14 | MEGA |
16 | 2225 | 50 | 130 | 4 | 14 | SUPER |
17 | 1595 | 33 | 85 | 2 | 14 | SPECIAL |
18 | 2325 | 33 | 210 | 4 | 15 | MEGA |
19 | 2095 | 33 | 250 | 4 | 15 | SUPER |
20 | 4395 | 66 | 452 | 8 | 14 | UBER |
... | ... | ... | ... | ... | ... | ... |
The test table has more than 1000 rows.
id | price | speed | hd | ram | screen |
---|---|---|---|---|---|
10 | 2575 | 50 | 210 | 4 | 15 |
11 | 2195 | 33 | 170 | 8 | 15 |
15 | 2699 | 50 | 212 | 8 | 14 |
29 | 3095 | 33 | 340 | 16 | 14 |
30 | 3244 | 66 | 245 | 8 | 14 |
38 | 3795 | 66 | 500 | 8 | 14 |
45 | 3495 | 50 | 340 | 16 | 14 |
46 | 2695 | 33 | 245 | 8 | 14 |
48 | 1749 | 25 | 120 | 4 | 14 |
51 | 2499 | 33 | 170 | 4 | 14 |
52 | 2395 | 33 | 130 | 4 | 14 |
59 | 2945 | 66 | 210 | 8 | 17 |
65 | 2195 | 66 | 85 | 2 | 14 |
66 | 1495 | 25 | 170 | 4 | 14 |
70 | 3095 | 66 | 245 | 8 | 14 |
86 | 1999 | 33 | 120 | 8 | 14 |
91 | 2975 | 50 | 210 | 4 | 17 |
92 | 2145 | 66 | 130 | 4 | 14 |
93 | 2420 | 33 | 170 | 8 | 15 |
94 | 2505 | 50 | 210 | 8 | 14 |
104 | 2999 | 66 | 330 | 4 | 15 |
... | ... | ... | ... | ... | ... |
SQL Call
SELECT * FROM KNN ( ON computers_train1_clustered AS TrainingTable ON computers_test1 AS TestTable OUT TABLE OutputTable (knn_output) USING K (50) ResponseColumn ('computer_category') DistanceFeatures ('price', 'speed', 'hd', 'ram', 'screen') VotingWeight (1) IDColumn ('id') ) AS dt;
Output
message |
---|
Successful! The final result is successfully stored in the table specified in OutputTable argument. |
This query returns the following table:
SELECT * FROM knn_output ORDER BY id;
id | computer_category |
---|---|
10 | MEGA |
11 | SUPER |
15 | MEGA |
29 | HYPER |
30 | HYPER |
38 | UBER |
45 | UBER |
46 | MEGA |
48 | SPECIAL |
51 | MEGA |
52 | MEGA |
59 | HYPER |
65 | SUPER |
66 | SPECIAL |
70 | HYPER |
86 | SUPER |
91 | HYPER |
92 | SUPER |
93 | MEGA |
94 | MEGA |
104 | HYPER |
... | ... |