1.1 - 8.10 - RandomSample Syntax Elements - Teradata Vantage

Teradata Vantage™ - Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
1.1
8.10
Published
October 2019
Content Type
Programming Reference
Publication ID
B700-4003-079K
Language
English (United States)
NumSample
Specify the size of each sample, an INTEGER value. If you specify n sample_size values, the function selects n samples. For each sample, the output table has sample_size rows. The output column set_id identifies the sample to which each row belongs.
WeightColumn
[Optional] Specify the name of the InputTable column that contains weights for weighted sampling. The weight_column must have a numeric SQL data type.
Default behavior: Rows have equal weight.
SamplingMode
[Optional] Specify the sampling mode:
Option Description
'basic' (Default) Each InputTable row has a probability of being selected that is proportional to its weight. Weight of each row is in weight_column.
'kmeans++' One row is selected in each of k iterations, where k is number of desired output rows. First row is selected randomly. In subsequent iterations, probability of row being selected is proportional to value in WeightColumn multiplied by distance from nearest row in set of selected rows. Distance is calculated using methods specified by Distance and CategoricalDistance syntax elements.
'kmeans||' Enhanced version of KMeans++ that exploits parallel architecture to accelerate sampling process. Algorithm is described in paper Scalable K-Means++ by Bahmani et al (http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf). At each iteration, probability that a row is selected is proportional to value in WeightColumn multiplied by distance from nearest row in set of selected rows (as in KMeans++). However, KMeans|| algorithm oversamples at each iteration, significantly reducing required number of iterations; therefore, resulting set of rows might have more than k data points. Each row in resulting set is then weighted by number of rows in table that are closer to that row than to any other selected row, and rows are clustered to produce exactly k rows.
Tip: For optimal performance, use 'kmeans++' when the desired sample size is less than 15 and 'kmeans||' otherwise.
Distance
[Optional] For KMeans++ and KMeans|| sampling, specify the function for computing the distance between numerical variables.
TargetColumns
[Optional] Required for KMeans++ and KMeans|| sampling.

Specify the names of the InputTable columns to use to calculate the distance between numerical variables.

NumericAsCategorical
[Optional] For KMeans++ and KMeans|| sampling, specify the names of the InputTable columns that contain numerical variables to treat as categorical variables.
Default behavior: No numerical variables are treated as categorical variables.
CategoryWeights
[Optional] For KMeans++ and KMeans|| sampling, specify the weights (DOUBLE PRECISION values) of the categorical variables, including those that the NumericAsCategorical syntax element specifies. Specify the weights in the order (from left to right) that the variables appear in the input table. When calculating the distance between two rows, distances between categorical values are scaled by these weights.
Default behavior: All categories have equal weight.
CategoricalDistance
[Optional] For KMeans++ and KMeans|| sampling, specify the function for computing the distance between categorical variables:
Option Description
'overlap' (Default) Distance between two variables is 0 if they are the same, 1 if they are different.
'hamming' Distance between two variables is Hamming distance between strings that represent them. Strings must have equal length.
Seed
[Optional] Specify the random seed the algorithm uses for repeatable results. The seed must be a LONG value.
If you specify Seed, you must also specify SeedColumn.
For repeatable results, use both the Seed and UniqueID syntax elements. For more information, see Nondeterministic Results and UniqueID Syntax Element.
SeedColumn
[Optional] Specify the names of the InputTable columns by which to partition the input. Function calls that use the same input data, seed, and seed_column output the same result.
If you specify SeedColumn, you must also specify Seed.
Ideally, the number of distinct values in the seed_column is the same as the number of workers in the cluster. A large number of distinct values in the seed_column degrades function performance.
OverSamplingRate
[Optional] For KMeans|| sampling, specifies the oversampling rate (a positive DOUBLE PRECISION value). The function multiplies rate by sample_size (for each sample_size).
Default: 1.0
IterNum
[Optional] For KMeans|| sampling, specify the number of iterations (a positive INTEGER value).
Default: 5
SetIDAsFirstColumn
Specify whether the function-generated set_id column is the first output table column. If you specify 'false', set_id is the last output table column.
Default: 'true'