This example uses KMeans|| sampling. Like Example 2, this example treats the numeric variables cyl, gear, and carb as categorical variables and uses the categorical variables vs and am. However, this example uses the Manhattan distance metric for the numerical variables and the Hamming distance metric for the categorical variables. Because the Hamming distance metric requires categories of equal length, assume that in input table column am, 'manual' has been changed to 'manualsys' (which is the same length as 'automatic').
Input
- InputTable: fs_input1, created from fs_input (in RandomSample Example 1: Basic Sampling (Weighted)) and populated with these statements:
CREATE MULTISET TABLE fs_input1 AS ( SELECT * FROM fs_input ) WITH DATA; UPDATE fs_input1 SET am='manualsys' WHERE am='manual';
SQL Call
SELECT * FROM RandomSample ( ON fs_input1 AS InputTable USING NumSample (20) SamplingMode ('kmeans||') InputColumns ('mpg:carb') CategoryWeights (1000, 10, 100, 100, 100) AsCategories ('cyl' ,'gear', 'carb') CategoricalDistance ('hamming') Distance ('manhattan') Seed (1) IterationNum (2) SeedColumn ('model') ) AS dt ORDER BY 1,2,3;
Output
set_id | mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 12.42 | 8 | 414.4 | 228 | 3.324 | 4.7398 | 16.808 | S | automatic | 3 | 4 |
0 | 15.8 | 8 | 351 | 264 | 4.22 | 3.17 | 14.5 | S | manualsys | 5 | 4 |
0 | 17.225 | 8 | 349 | 162.5 | 2.9375 | 3.58125 | 16.9525 | S | automatic | 3 | 2 |
0 | 17.3 | 8 | 275.8 | 180 | 3.07 | 3.73 | 17.6 | S | automatic | 3 | 3 |
0 | 19.2 | 6 | 67.6 | 123 | 3.92 | 3.44 | 18.3 | V | automatic | 4 | 4 |
0 | 19.7 | 6 | 145 | 175 | 3.62 | 2.77 | 15.5 | S | manualsys | 5 | 6 |
0 | 21.4 | 4 | 121 | 109 | 4.11 | 2.78 | 18.6 | V | manualsys | 4 | 2 |
0 | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | V | automatic | 3 | 1 |
0 | 21.5 | 4 | 120.1 | 97 | 3.7 | 2.465 | 20.01 | V | automatic | 3 | 1 |
0 | 23.6 | 4 | 143.75 | 78.5 | 3.805 | 3.17 | 21.45 | V | automatic | 4 | 2 |