Arguments - Aster Analytics

Teradata Aster Analytics Foundation User Guide

Product
Aster Analytics
Release Number
6.21
Published
November 2016
Language
English (United States)
Last Update
2018-04-14
dita:mapPath
kiu1466024880662.ditamap
dita:ditavalPath
AA-notempfilter_pdf_output.ditaval
dita:id
B700-1021
lifecycle
previous
Product Category
Software
Argument Category Description
InputTable Required Specifies the name of the table that contains the data set from which to take samples.
NumSample Required Specifies both the number of samples and their sizes. For each sample_size (an INTEGER value), the function selects a sample that has sample_size rows.
WeightColumn Optional Specifies the name of the input_table column that contains weights for weighted sampling. The weight_column must have a numeric SQL data type. By default, rows have equal weight.
SamplingMode Optional Specifies the sampling mode:
  • 'basic' (default)

    Each input_table row has a probability of being selected that is proportional to its weight. The weight of each row is in weight_column.

  • 'kmeans++'

    One row is selected in each of k iterations, where k is the number of desired output rows. The first row is selected randomly. In subsequent iterations, the probability of a row being selected is proportional to the value in the WeightColumn multiplied by the distance from the nearest row in the set of selected rows. The distance is calculated using the methods specified by the Distance and CategoricalDistance arguments.

  • 'kmeans||'

    Enhanced version of KMeans++ that exploits parallel architecture to accelerate the sampling process. The algorithm is described in the paper Scalable K-Means++ by Bahmani et al (http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf). Briefly, at each iteration, the probability that a row is selected is proportional to the value in the WeightColumn multiplied by the distance from the nearest row in the set of selected rows (as in KMeans++). However, the KMeans|| algorithm oversamples at each iteration, significantly reducing the required number of iterations; therefore, the resulting set of rows might have more than k data points. Each row in the resulting set is then weighted by the number of rows in the table that are closer to that row than to any other selected row, and the rows are clustered to produce exactly k rows.

Tip: For optimal performance, use 'kmeans++' when the desired sample size is less than 15 and 'kmeans||' otherwise.
Distance Optional For KMeans++ and KMeans|| sampling, specifies the function for computing the distance between numerical variables:
InputColumns Optional For KMeans++ and KMeans|| sampling, specifies the names of the input_table columns to use to calculate the distance between numerical variables.
AsCategories Optional For KMeans++ and KMeans|| sampling, specifies the names of the input_table columns that contain numerical variables to treat as categorical variables.
CategoryWeights Optional For KMeans++ and KMeans|| sampling, specifies the weights (DOUBLE PRECISION values) of the categorical variables, including those that the AsCategories argument specifies. Specify the weights in the order (from left to right) that the variables appear in the input table. When calculating the distance between two rows, distances between categorical values are scaled by these weights.
CategoricalDistance Optional For KMeans++ and KMeans|| sampling, specifies the function for computing the distance between categorical variables:

'overlap' (default): The distance between two variables is 0 if they are the same and 1 if they are different.

'hamming': The distance between two variables is the Hamming distance between the strings that represent them. The strings must have equal length.

Seed Optional Specifies the random seed with which to initialize the algorithm (a LONG value). If you specify Seed, you must also specify SeedColumn.
SeedColumn Optional Specifies the names of the input_table columns by which to partition the input. Function calls that use the same input data, seed, and seed_column output the same result. If you specify SeedColumn, you must also specify Seed.
Ideally, the number of distinct values in the seed_column is the same as the number of workers in the cluster. A very large number of distinct values in the seed_column degrades function performance.
OverSamplingRate Optional For KMeans|| sampling, specifies the oversampling rate (a DOUBLE PRECISION value greater than 0.0). The function multiplies rate by sample_size (for each sample_size). The default rate is 1.0.
IterationNum Optional For KMeans|| sampling, specifies the number of iterations (an INTEGER value greater than 0). The default number_of_iterations is 5.