Argument | Category | Description |
---|---|---|
InputTable | Required | Specifies the name of the table that contains the data set from which to take samples. |
NumSample | Required | Specifies both the number of samples and their sizes. For each sample_size (an INTEGER value), the function selects a sample that has sample_size rows. |
WeightColumn | Optional | Specifies the name of the input_table column that contains weights for weighted sampling. The weight_column must have a numeric SQL data type. By default, rows have equal weight. |
SamplingMode | Optional | Specifies the sampling mode:
Tip: For optimal performance, use 'kmeans++' when the desired sample size is less than 15 and 'kmeans||' otherwise.
|
Distance | Optional | For KMeans++ and KMeans|| sampling, specifies the function for computing the distance between numerical variables:
|
InputColumns | Optional | For KMeans++ and KMeans|| sampling, specifies the names of the input_table columns to use to calculate the distance between numerical variables. |
AsCategories | Optional | For KMeans++ and KMeans|| sampling, specifies the names of the input_table columns that contain numerical variables to treat as categorical variables. |
CategoryWeights | Optional | For KMeans++ and KMeans|| sampling, specifies the weights (DOUBLE PRECISION values) of the categorical variables, including those that the AsCategories argument specifies. Specify the weights in the order (from left to right) that the variables appear in the input table. When calculating the distance between two rows, distances between categorical values are scaled by these weights. |
CategoricalDistance | Optional | For KMeans++ and KMeans|| sampling, specifies the function for computing the distance between categorical variables: 'overlap' (default): The distance between two variables is 0 if they are the same and 1 if they are different. 'hamming': The distance between two variables is the Hamming distance between the strings that represent them. The strings must have equal length. |
Seed | Optional | Specifies the random seed with which to initialize the algorithm (a LONG value). If you specify Seed, you must also specify SeedColumn. |
SeedColumn | Optional | Specifies the names of the input_table columns by which to partition the input. Function calls that use the same input data, seed, and seed_column output the same result. If you specify SeedColumn, you must also specify Seed. Ideally, the number of distinct values in the seed_column is the same as the number of workers in the cluster. A very large number of distinct values in the seed_column degrades function performance.
|
OverSamplingRate | Optional | For KMeans|| sampling, specifies the oversampling rate (a DOUBLE PRECISION value greater than 0.0). The function multiplies rate by sample_size (for each sample_size). The default rate is 1.0. |
IterationNum | Optional | For KMeans|| sampling, specifies the number of iterations (an INTEGER value greater than 0). The default number_of_iterations is 5. |