| |
Methods defined here:
- __init__(self, data=None, num_sample=None, weight_column=None, sampling_mode='Basic', distance='EUCLIDEAN', input_columns=None, as_categories=None, category_weights=None, categorical_distance='OVERLAP', seed=None, seed_column=None, over_sampling_rate=1.0, iteration_num=5, setid_as_first_column=True, data_sequence_column=None)
- DESCRIPTION:
The RandomSample function takes a data set and uses a specified
sampling method to output one or more random samples. Each sample has
exactly the number of rows specified.
PARAMETERS:
data:
Required Argument.
Specifies the name of the teradataml DataFrame that contains the data
set from which to take samples.
num_sample:
Required Argument.
Specifies both the number of samples and their sizes. For each
sample_size (an int value), the function selects a sample that has
sample_size rows.
Types: int OR list of Integers (int)
weight_column:
Optional Argument.
Specifies the name of the teradataml DataFrame column that
contains weights for weighted sampling. The weight_column must
have a numeric SQL data type. By default, rows have equal weight.
Types: str
sampling_mode:
Optional Argument.
Specifies the sampling mode and can be one of the following:
• "Basic": Each input_table row has a probability of being
selected that is proportional to its weight. The weight
of each row is in weight_column.
• "KMeans++": One row is selected in each of k iterations,
where k is the number of desired output rows. The first
row is selected randomly. In subsequent iterations, the
probability of a row being selected is proportional to the
value in the weight_column multiplied by the distance from
the nearest row in the set of selected rows. The distance
is calculated using the methods specified by the distance
and categorical_distance arguments.
• "KMeans||": Enhanced version of KMeans++ that exploits
parallel architecture to accelerate the sampling process.
The algorithm is described in the paper Scalable KMeans++
by Bahmani et al (http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf).
Briefly, at each iteration, the probability that a row is
selected is proportional to the value in the weight_column
multiplied by the distance from the nearest row in the set of
selected rows (as in KMeans++). However, the KMeans|| algorithm
oversamples at each iteration, significantly reducing the
required number of iterations; therefore, the resulting set of
rows might have more than k data points. Each row in the
resulting set is then weighted by the number of rows in the
teradataml DataFrame that are closer to that row than to any
other selected row, and the rows are clustered to produce
exactly k rows.
Tip: For optimal performance, use "KMeans++" when the
desired sample size is less than 15 and "KMeans||" otherwise.
Default Value: "Basic"
Permitted Values: Basic, KMeans++, KMeans||
Types: str
distance:
Optional Argument.
For KMeans++ and KMeans|| sampling, specifies the function for
computing the distance between numerical variables:
• 'EUCLIDEAN' : The distance between two variables is defined
using Euclidean Distance.
• 'MANHATTAN': The distance between two variables is defined
using Manhattan Distance.
Default Value: "EUCLIDEAN"
Permitted Values: MANHATTAN, EUCLIDEAN
Types: str
input_columns:
Optional Argument.
For KMeans++ and KMeans|| sampling, specifies the names of the
teradataml DataFrame columns to calculate the distance between
numerical variables.
Types: str OR list of Strings (str)
as_categories:
Optional Argument.
For KMeans++ and KMeans|| sampling, specifies the names of the
teradataml DataFrame columns that contain numerical variables
to treat as categorical variables.
Types: str OR list of Strings (str)
category_weights:
Optional Argument.
For KMeans++ and KMeans|| sampling, specifies the weights
(float values) of the categorical variables, including those
that 'as_categories' argument specifies. Specify the weights in
the order (from left to right) that the variables appear in the
input teradataml Dataframe. When calculating the distance between
two rows, distances between categorical values are scaled by
these weights.
Types: float or list of Floats (float).
categorical_distance:
Optional Argument.
For KMeans++ and KMeans|| sampling, specifies the function for
computing the distance between categorical variables:
• "OVERLAP" : The distance between two variables is 0 if
they are the same and 1 if they are different.
• "HAMMING": The distance beween two variables is the Hamming
distance between the strings that represent them. The
strings must have equal length.
Default Value: "OVERLAP"
Permitted Values: OVERLAP, HAMMING
Types: str
seed:
Optional Argument.
Specifies the random seed used to initialize the algorithm.
Types: int
seed_column:
Optional Argument.
Specifies the names of the teradataml DataFrame columns by
which to partition the input. Function calls that use the same
input data, seed, and seed_column output the same result. If
you specify seed_column, you must also specify seed.
Note: Ideally, the number of distinct values in the seed_column
is the same as the number of workers in the cluster. A very
large number of distinct values in the seed_column degrades
function performance.
Types: str OR list of Strings (str)
over_sampling_rate:
Optional Argument.
For KMeans|| sampling, specifies the oversampling rate (a float
value greater than 0.0). The function multiplies rate by
sample size (for each sample size).
Default Value: 1.0
Types: float
iteration_num:
Optional Argument.
For KMeans|| sampling, specifies the number of iterations (an
int value greater than 0).
Default Value: 5
Types: int
setid_as_first_column:
Optional Argument.
Specifies whether the generated set_id values to be included as first
column in output.
Note: "setid_as_first_column" argument support is only available
when teradataml is connected to Vantage 1.1 or later.
Default Value: True
Types: bool
data_sequence_column:
Optional Argument.
Specifies the list of column(s) that uniquely identifies each
row of the input argument "data". The argument is used to ensure
deterministic results for functions which produce results that
vary from run to run.
Types: str OR list of Strings (str)
RETURNS:
Instance of RandomSample.
Output teradataml DataFrames can be accessed using attribute
references, such as RandomSampleObj.<attribute_name>.
Output teradataml DataFrame attribute name is:
result
RAISES:
TeradataMlException
EXAMPLES:
# Load example data.
load_example_data("randomsample", ["fs_input", "fs_input1"])
# Create teradataml DataFrame objects. The input tables have
# observations of 11 variables for different models of cars.
fs_input = DataFrame.from_table("fs_input")
fs_input1 = DataFrame.from_table("fs_input1")
# Example 1 - Basic Sampling (Weighted).
# This example uses basic sampling to select one sample of 10 rows,
# which are weighted by car weight.
RandomSample_out1 = RandomSample(data = fs_input,
num_sample = 10,
weight_column = "wt",
sampling_mode = "basic",
seed = 1,
seed_column = ["model"])
# Print the result DataFrame
print(RandomSample_out1)
# Example 2 - KMeans++ Sampling.
# This example uses KMeans++ sampling with the Manhattan
# distance metric, and treats the numeric variables cyl,
# gear, and carb as categorical variables.
RandomSample_out2 = RandomSample(data = fs_input,
num_sample = 10,
sampling_mode = "KMeans++",
distance = "manhattan",
input_columns = ['mpg','cyl','disp','hp','drat','wt','qsec','vs','am','gear','carb'],
as_categories = ["cyl","gear","carb"],
category_weights = [1000.0,10.0,100.0,100.0,100.0],
seed = 1,
seed_column = ["model"]
)
# Print the result DataFrame
print(RandomSample_out2.result)
# Example 3 - KMeans|| Sampling.
# This example uses KMeans|| sampling with the Manhattan
# distance metric for the numerical variables and the Hamming
# distance metric for the categorical variables.
RandomSample_out3 = RandomSample(data = fs_input1,
num_sample = 20,
sampling_mode = "KMeans||",
distance = "MANHATTAN",
input_columns = ['mpg','cyl','disp','hp','drat','wt','qsec','vs','am','gear','carb'],
as_categories = ["cyl","gear","carb"],
category_weights = [1000.0,10.0,100.0,100.0,100.0],
categorical_distance = "HAMMING",
seed = 1,
seed_column = ["model"],
iteration_num = 2
)
# Print the result DataFrame
print(RandomSample_out3.result)
# Example 4 - This example uses basic sampling to select 3 sample
# sets of sizes 2, 3 and 1 rows, weighted by car weight.
RandomSample_out4 = RandomSample(data = fs_input,
num_sample = [2,3,1],
weight_column = "wt"
)
# Print the result DataFrame
print(RandomSample_out4)
- __repr__(self)
- Returns the string representation for a RandomSample class instance.
- get_build_time(self)
- Function to return the build time of the algorithm in seconds.
When model object is created using retrieve_model(), then the value returned is
as saved in the Model Catalog.
- get_prediction_type(self)
- Function to return the Prediction type of the algorithm.
When model object is created using retrieve_model(), then the value returned is
as saved in the Model Catalog.
- get_target_column(self)
- Function to return the Target Column of the algorithm.
When model object is created using retrieve_model(), then the value returned is
as saved in the Model Catalog.
- show_query(self)
- Function to return the underlying SQL query.
When model object is created using retrieve_model(), then None is returned.
|