1.0 - 8.00 - RandomSample Example 2: KMeans++ Sampling - Teradata Vantage

Teradata® Vantage Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
1.0
8.00
Release Date
May 2019
Content Type
Programming Reference
Publication ID
B700-4003-098K
Language
English (United States)

This example uses KMeans++ sampling with the Manhattan distance metric, and treats the numeric variables cyl, gear, and carb as categorical variables (and the categorical variables vs and am). The category weights are assigned in the order that the columns appear in the input table: 1000 to cyl, 10 to vs, 100 to am, 100 to gear, and 100 to carb.

Input

SQL Call

SELECT * FROM RandomSample (
  ON fs_input AS InputTable
  USING
  NumSample (10)
  SamplingMode ('kmeans++')
  InputColumns ('mpg:carb')
  CategoryWeights (1000, 10, 100, 100, 100)
  AsCategories ('cyl', 'gear', 'carb')
  Distance ('manhattan')
  Seed (1)
  SeedColumn ('model')
) AS dt ORDER BY 1, 2, 3;

Output

set_id sn model mpg cyl disp hp drat wt qsec vs am gear carb
0 2 Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 S manual 4 4
0 4 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 V automatic 3 1
0 13 Merc 450SL 17.3 8 275.8 180 3.07 3.73 17.6 S automatic 3 3
0 18 Fiat 128 32.4 4 78.7 66 4.08 2.2 19.47 V manual 4 1
0 21 Toyota Corona 21.5 4 120.1 97 3.7 2.465 20.01 V automatic 3 1
0 24 Camaro Z28 13.3 8 350 245 3.73 3.84 15.41 S automatic 3 4
0 25 Pontiac Firebird 19.2 8 400 175 3.08 3.845 17.05 S automatic 3 2
0 27 Porsche 914-2 26 4 120.3 91 4.43 2.14 16.7 S manual 5 2
0 30 Ferrari Dino 19.7 6 301 335 3.54 3.57 14.6 S manual 5 6
0 31 Maserati Bora 15 8 301 335 3.54 3.57 14.6 S manual 5 8