RandomSample Arguments - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product

Teradata Vantage

Release Number

8.00

1.0

Published

May 2019

Language

English (United States)

Last Update

2019-11-22

dita:mapPath

blj1506016597986.ditamap

dita:ditavalPath

blj1506016597986.ditaval

dita:id

B700-4003

lifecycle

Product Category

Teradata Vantage™

NumSample

Specify both the number of samples and their sizes. For each sample_size (an INTEGER value), the function selects a sample that has sample_size rows.

WeightColumn

[Optional] Specify the name of the input_table column that contains weights for weighted sampling. The weight_column must have a numeric SQL data type.

Default behavior: Rows have equal weight.

SamplingMode

[Optional] Specify the sampling mode:

Option	Description
'basic' (Default)	Each input_table row has a probability of being selected that is proportional to its weight. Weight of each row is in weight_column.
'kmeans++'	One row is selected in each of k iterations, where k is number of desired output rows. First row is selected randomly. In subsequent iterations, probability of row being selected is proportional to value in WeightColumn multiplied by distance from nearest row in set of selected rows. Distance is calculated using methods specified by Distance and CategoricalDistance arguments.
'kmeans\|\|'	Enhanced version of KMeans++ that exploits parallel architecture to accelerate sampling process. Algorithm is described in paper Scalable K-Means++ by Bahmani et al (http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf). At each iteration, probability that a row is selected is proportional to value in WeightColumn multiplied by distance from nearest row in set of selected rows (as in KMeans++). However, KMeans\|\| algorithm oversamples at each iteration, significantly reducing required number of iterations; therefore, resulting set of rows might have more than k data points. Each row in resulting set is then weighted by number of rows in table that are closer to that row than to any other selected row, and rows are clustered to produce exactly k rows.

Tip: For optimal performance, use 'kmeans++' when the desired sample size is less than 15 and 'kmeans||' otherwise.

Distance

[Optional] For KMeans++ and KMeans|| sampling, specify the function for computing the distance between numerical variables.

InputColumns

[Optional] Required for KMeans++ and KMeans|| sampling.

Specify the names of the input_table columns to use to calculate the distance between numerical variables.

AsCategories

[Optional] For KMeans++ and KMeans|| sampling, specify the names of the input_table columns that contain numerical variables to treat as categorical variables.

Default behavior: No numerical variables are treated as categorical variables.

CategoryWeights

[Optional] For KMeans++ and KMeans|| sampling, specify the weights (DOUBLE PRECISION values) of the categorical variables, including those that the AsCategories argument specifies. Specify the weights in the order (from left to right) that the variables appear in the input table. When calculating the distance between two rows, distances between categorical values are scaled by these weights.

Default behavior: All categories have equal weight.

CategoricalDistance

[Optional] For KMeans++ and KMeans|| sampling, specify the function for computing the distance between categorical variables:

Option	Description
'overlap' (Default)	Distance between two variables is 0 if they are the same, 1 if they are different.
'hamming'	Distance between two variables is Hamming distance between strings that represent them. Strings must have equal length.

Seed

[Optional] Specify the random seed the algorithm uses for repeatable results (for more information, see Nondeterministic Results). The seed must be a LONG value.

If you specify Seed, you must also specify SeedColumn.

SeedColumn

[Optional] Specify the names of the input_table columns by which to partition the input. Function calls that use the same input data, seed, and seed_column output the same result.

If you specify SeedColumn, you must also specify Seed.

Ideally, the number of distinct values in the seed_column is the same as the number of workers in the cluster. A large number of distinct values in the seed_column degrades function performance.

OverSamplingRate

[Optional] For KMeans|| sampling, specifies the oversampling rate (a positive DOUBLE PRECISION value). The function multiplies rate by sample_size (for each sample_size).

Default: 1.0

IterationNum

[Optional] For KMeans|| sampling, specify the number of iterations (a positive INTEGER value).

Default: 5