- Specifies the name of the table that contains the data set from which to take samples.
- Specifies both the number of samples and their sizes. For each sample_size (an INTEGER value), the function selects a sample that has sample_size rows.
- [Optional] Specifies the name of the input_table column that contains weights for weighted sampling. The weight_column must have a numeric SQL data type. Default behavior: Rows have equal weight.
- [Optional] Specifies the sampling mode:
Tip: For optimal performance, use 'kmeans++' when the desired sample size is less than 15 and 'kmeans||' otherwise.
Each input_table row has a probability of being selected that is proportional to its weight. The weight of each row is in weight_column.
One row is selected in each of k iterations, where k is the number of desired output rows. The first row is selected randomly. In subsequent iterations, the probability of a row being selected is proportional to the value in the WeightColumn multiplied by the distance from the nearest row in the set of selected rows. The distance is calculated using the methods specified by the Distance and CategoricalDistance arguments.
Enhanced version of KMeans++ that exploits parallel architecture to accelerate the sampling process. The algorithm is described in the paper Scalable K-Means++ by Bahmani et al (http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf). Briefly, at each iteration, the probability that a row is selected is proportional to the value in the WeightColumn multiplied by the distance from the nearest row in the set of selected rows (as in KMeans++). However, the KMeans|| algorithm oversamples at each iteration, significantly reducing the required number of iterations; therefore, the resulting set of rows might have more than k data points. Each row in the resulting set is then weighted by the number of rows in the table that are closer to that row than to any other selected row, and the rows are clustered to produce exactly k rows.
- 'basic' (Default)
- [Optional] For KMeans++ and KMeans|| sampling, specifies the function for computing the distance between numerical variables:
- 'euclidean' (Default): The distance between two variables is defined in Euclidean Distance (found in the Background section of the function VectorDistance).
- 'manhattan': The distance between two variables is defined in Manhattan Distance (found in the Background section of the function VectorDistance).
- [Optional] Required for KMeans++ and KMeans|| sampling.
Specifies the names of the input_table columns to use to calculate the distance between numerical variables.
- [Optional] For KMeans++ and KMeans|| sampling, specifies the names of the input_table columns that contain numerical variables to treat as categorical variables. Default behavior: No numerical variables are treated as categorical variables.
- [Optional] For KMeans++ and KMeans|| sampling, specifies the weights (DOUBLE PRECISION values) of the categorical variables, including those that the AsCategories argument specifies. Specify the weights in the order (from left to right) that the variables appear in the input table. When calculating the distance between two rows, distances between categorical values are scaled by these weights. Default behavior: All categories have equal weight.
- [Optional] For KMeans++ and KMeans|| sampling, specifies the function for computing the distance between categorical variables:
The distance between two variables is 0 if they are the same and 1 if they are different.
The distance between two variables is the Hamming distance between the strings that represent them. The strings must have equal length.
- 'overlap' (Default)
- [Optional] If you specify Seed, you must also specify SeedColumn.
Specifies the random seed with which to initialize the algorithm (a LONG value).
- [Optional] If you specify SeedColumn, you must also specify Seed.
Specifies the names of the input_table columns by which to partition the input. Function calls that use the same input data, seed, and seed_column output the same result.Ideally, the number of distinct values in the seed_column is the same as the number of workers in the cluster. A large number of distinct values in the seed_column degrades function performance.
- [Optional] For KMeans|| sampling, specifies the oversampling rate (a positive DOUBLE PRECISION value). The function multiplies rate by sample_size (for each sample_size). Default: 1.0.
- [Optional] For KMeans|| sampling, specifies the number of iterations (a positive INTEGER value). Default: 5.