Teradata Package for Python Function Reference on VantageCloud Lake - sample - Teradata Package for Python - Look here for syntax, methods and examples for the functions included in the Teradata Package for Python.
Teradata® Package for Python Function Reference on VantageCloud Lake
- Deployment
- VantageCloud
- Edition
- Lake
- Product
- Teradata Package for Python
- Release Number
- 20.00.00.03
- Published
- December 2024
- ft:locale
- en-US
- ft:lastEdition
- 2024-12-19
- dita:id
- TeradataPython_FxRef_Lake_2000
- Product Category
- Teradata Vantage
- teradataml.dataframe.dataframe.DataFrame.sample = sample(self, n=None, frac=None, replace=False, randomize=False, case_when_then=None, case_else=None, stratify_column=None, seed=None, id_column=None)
- DESCRIPTION:
Allows to sample few rows from dataframe directly or based on conditions.
Creates a new column 'sampleid' which has a unique id for each sample
sampled, it helps to uniquely identify each sample.
PARAMETERS:
n:
Required Argument, if neither of 'frac' and 'case_when_then' are specified.
Specifies a set of positive integer constants that specifies the number of
rows to be sampled from the teradataml DataFrame.
Example:
n = 10 or n = [10] or n = [10, 20, 30, 40]
Default Value: None
Types: int or list of ints.
Note:
1. You should use only one of the following arguments: 'n', 'frac' and 'case_when_then'.
2. No more than 16 samples can be requested per count description.
frac:
Required Argument, if neither of 'n' and 'case_when_then' are specified.
Specifies any set of unsigned floating point constant numbers in the half
opened interval (0,1] that means greater than 0 and less than or equal to 1.
It specifies the percentage of rows to be sampled from the teradataml DataFrame.
Example:
frac = 0.4 or frac = [0.4] or frac = [0.2, 0.5]
Default Value: None
Types: float or list of floats.
Note:
1. You should use only one of the following arguments: 'n', 'frac' and 'case_when_then'.
2. No more than 16 samples can be requested per count description.
3. Sum of elements in list should not be greater than 1 as total percentage cannot be
more than 100% and should not be less than or equal to 0.
4. Stratifying data sample is supported only when "stratify_column"
is used with "frac" argument.
5. List sizes must include a minimum of one float element and a maximum of two elements
when data sampled with stratification. The train data sample percentage
corresponds to the first element, whereas the test data sample percentage is
associated with the second element.
6. The remaining fraction is considered for sampling the data when "frac" has
only one fraction for data sampling with stratification.
replace:
Optional Argument.
Specifies if sampling should be done with replacement or not.
Default Value: False
Types: bool
randomize:
Optional Argument.
Specifies if sampling should be done across AMPs in Teradata or per AMP.
Default Value: False
Types: bool
case_when_then :
Required Argument, if neither of 'frac' and 'n' are specified.
Specifies condition and number of samples to be sampled as key value pairs.
Keys should be of type ColumnExpressions.
Values should be either of type int, float, list of ints or list of floats.
The following usage of key is not allowed:
case_when_then = {"gpa" > 2 : 2}
The following operators are supported:
comparison: ==, !=, <, <=, >, >=
boolean: & (and), | (or), ~ (not), ^ (xor)
Example :
case_when_then = {df.gpa > 2 : 2}
case_when_then = {df.gpa > 2 & df.stats == 'Novice' : [0.2, 0.3],
df.programming == 'Advanced' : [10,20,30]}
Default Value: None
Types: dictionary
Note:
1. You should use only one of the following arguments: 'n', 'frac' and 'case_when_then'.
2. No more than 16 samples can be requested per fraction description or count description.
3. If any value in dictionary is specified as list of floats then
sum of elements in list should not be greater than 1 as total percentage cannot be
more than 100% and should not be less than or equal to 0.
case_else :
Optional Argument.
Specifies number of samples to be sampled from rows where none of the conditions in
'case_when_then' are met.
Example :
case_else = 10
case_else = [10,20]
case_else = [0.5]
case_else = [0.2,0.4]
Default Value: None
Types: int or float or list of ints or list of floats
Note:
1. This argument can only be used with 'case_when_then'.
If used otherwise, below error will raised.
'case_else' can only be used when 'case_when_then' is specified.
2. No more than 16 samples can be requested per fraction description
or count description.
3. If case_else is list of floats then sum of elements in list should not be
greater than 1 as total percentage cannot be more than 100% and should not
be less than or equal to 0.
stratify_column:
Optional Argument.
Specifies column name that contains the labels indicating
which data needs to be stratified.
Notes:
1. Must be used with "frac" argument for stratifying data.
2. seed is supported for stratify column.
3. Arguments "stratify_column", "seed", "id_column" are supported only
for stratifying the data.
Types: str
seed:
Optional Argument.
Specifies the seed value which controls the data sample. The sample remains
same as long as the seed remains same. Use this argument to get the
deterministic samples. "seed" must be greater than or equal to 0 and
less than or equal to 2147483647.
Notes:
1. Random seed is generated internally when argument
is not specified.
2. Seed is supported only when only when "stratify_column" is used.
Ignored otherwise.
3. Arguments "stratify_column", "seed", "id_column" are supported only
for stratifying the data.
Types: int
id_column:
Required when "seed" is used. Optional otherwise.
Specifies the input data column name that has the
unique identifier for each row in the input.
Notes:
1. Arguments "stratify_column", "seed", "id_column" are supported only
for stratifying the data.
2. "id_column" is supported only when "stratify_column" is used.
Ignored otherwise.
Types: str
RETURNS:
teradataml DataFrame
RAISES:
1. ValueError - When columns of different dataframes are given in ColumnExpression.
or
When columns are given in string format and not ColumnExpression.
2. TeradataMlException - If types of input parameters are mismatched.
3. TypeError
Examples:
>>> from teradataml import *
>>> load_example_data("dataframe", "admissions_train")
>>> df = DataFrame("admissions_train")
# Print dataframe.
>>> df
masters gpa stats programming admitted
id
13 no 4.00 Advanced Novice 1
26 yes 3.57 Advanced Advanced 1
5 no 3.44 Novice Novice 0
19 yes 1.98 Advanced Advanced 0
15 yes 4.00 Advanced Advanced 1
40 yes 3.95 Novice Beginner 0
7 yes 2.33 Novice Novice 1
22 yes 3.46 Novice Beginner 0
36 no 3.00 Advanced Novice 0
38 yes 2.65 Advanced Beginner 1
# Sample with only n argument.
# Randomly samples 2 rows from the teradataml DataFrame.
# As there is only 1 sample 'sampleid' is 1.
>>> df.sample(n = 2)
masters gpa stats programming admitted SampleId
id
18 yes 3.81 Advanced Advanced 1 1
19 yes 1.98 Advanced Advanced 0 1
# Sample with multiple sample values for n.
# Creates 2 samples with 2 and 1 rows each respectively.
# There are 2 values(1,2) for 'sampleid' each for one sample.
>>> df.sample(n = [2, 1])
masters gpa stats programming admitted SampleId
id
1 yes 3.95 Beginner Beginner 0 1
10 no 3.71 Advanced Advanced 1 1
11 no 3.13 Advanced Advanced 1 2
# Sample with only frac parameter.
# Randomly samples 20% of total rows present in teradataml DataFrame.
>>> df.sample(frac = 0.2)
masters gpa stats programming admitted SampleId
id
18 yes 3.81 Advanced Advanced 1 1
15 yes 4.00 Advanced Advanced 1 1
14 yes 3.45 Advanced Advanced 0 1
35 no 3.68 Novice Beginner 1 1
27 yes 3.96 Advanced Advanced 0 1
25 no 3.96 Advanced Advanced 1 1
10 no 3.71 Advanced Advanced 1 1
9 no 3.82 Advanced Advanced 1 1
# Sample with multiple sample values for frac.
# Creates 2 samples each with 4% and 2% of total rows in teradataml DataFrame.
>>> df.sample(frac = [0.04, 0.02])
masters gpa stats programming admitted SampleId
id
29 yes 4.00 Novice Beginner 0 1
19 yes 1.98 Advanced Advanced 0 2
11 no 3.13 Advanced Advanced 1 1
# Sample with n and replace and randomization.
# Creates 2 samples with 2 and 1 rows respectively with possible redundant
# sampling as replace is True and also selects rows from different AMPS as
# randomize is True.
>>> df.sample(n = [2, 1], replace = True, randomize = True)
masters gpa stats programming admitted SampleId
id
12 no 3.65 Novice Novice 1 1
39 yes 3.75 Advanced Beginner 0 1
20 yes 3.90 Advanced Advanced 1 2
# Sample with frac and replace and randomization.
# Creates 2 samples with 4% and 2% of total rows in teradataml DataFrame
# respectively with possible redundant sampling and also selects rows from different AMPS.
>>> df.sample(frac = [0.04, 0.02], replace = True, randomize = True)
masters gpa stats programming admitted SampleId
id
7 yes 2.33 Novice Novice 1 2
30 yes 3.79 Advanced Novice 0 1
33 no 3.55 Novice Novice 1 1
# Sample with case_when_then.
# Creates 2 samples with 1, 2 rows respectively from rows which satisfy df.gpa < 2
# and 2.5% of rows from rows which satisfy df.stats == 'Advanced'.
>>> df.sample(case_when_then={df.gpa < 2 : [1, 2], df.stats == 'Advanced' : 0.025})
masters gpa stats programming admitted SampleId
id
19 yes 1.98 Advanced Advanced 0 1
24 no 1.87 Advanced Novice 1 1
11 no 3.13 Advanced Advanced 1 3
# Sample with case_when_then and replace, randomize.
# Creates 2 samples with 1, 2 rows respectively from rows which satisfy df.gpa < 2
# and 2.5% of rows from rows which satisfy df.stats == 'Advanced' and selects rows
# from different AMPs with replacement.
>>> df.sample(replace = True, randomize = True, case_when_then={df.gpa < 2 : [1, 2],
df.stats == 'Advanced' : 0.025})
masters gpa stats programming admitted SampleId
id
24 no 1.87 Advanced Novice 1 1
24 no 1.87 Advanced Novice 1 2
24 no 1.87 Advanced Novice 1 2
24 no 1.87 Advanced Novice 1 2
24 no 1.87 Advanced Novice 1 2
24 no 1.87 Advanced Novice 1 1
31 yes 3.50 Advanced Beginner 1 3
# Sample with case_when_then and case_else.
# Creates 7 samples 2 with 1, 3 rows from rows which satisfy df.gpa > 2.
# 1 sample with 5 rows from rows which satisify df.programming == 'Novice'.
# 1 sample with 5 rows from rows which satisify df.masters == 'no'.
# 1 sample with 1 row from rows which does not meet all above conditions.
>>> df.sample(case_when_then = {df.gpa > 2 : [1, 3], df.stats == 'Novice' : [1, 2],
df.programming == 'Novice' : 5, df.masters == 'no': 5}, case_else = 1)
masters gpa stats programming admitted SampleId
id
24 no 1.87 Advanced Novice 1 5
2 yes 3.76 Beginner Beginner 0 1
12 no 3.65 Novice Novice 1 2
38 yes 2.65 Advanced Beginner 1 2
36 no 3.00 Advanced Novice 0 2
19 yes 1.98 Advanced Advanced 0 7
# Sample with case_when_then and case_else
# Creates 4 samples 2 with 1, 3 rows from rows which satisfy df.gpa > 2.
# 2 samples with 2.5%, 5% of rows from all the rows which does not
# meet condition df.gpa < 2.
>>> df.sample(case_when_then = {df.gpa < 2 : [1, 3]}, case_else = [0.025, 0.05])
masters gpa stats programming admitted SampleId
id
9 no 3.82 Advanced Advanced 1 4
24 no 1.87 Advanced Novice 1 1
26 yes 3.57 Advanced Advanced 1 4
13 no 4.00 Advanced Novice 1 3
19 yes 1.98 Advanced Advanced 0 1
# Sample with case_when_then, case_else, replace, randomize
# Creates 4 samples 2 with 1, 3 rows from rows which satisfy df.gpa > 2 and
# 2 samples with 2.5%, 5% of rows from all the rows which does not
# meet condition df.gpa < 2 with possible redundant replacement
# and also selects rows from different AMPs
>>> df.sample(case_when_then = {df.gpa < 2 : [1, 3]}, replace = True,
randomize = True, case_else = [0.025, 0.05])
masters gpa stats programming admitted SampleId
id
19 yes 1.98 Advanced Advanced 0 1
19 yes 1.98 Advanced Advanced 0 2
19 yes 1.98 Advanced Advanced 0 2
19 yes 1.98 Advanced Advanced 0 2
19 yes 1.98 Advanced Advanced 0 2
40 yes 3.95 Novice Beginner 0 3
3 no 3.70 Novice Beginner 1 4
19 yes 1.98 Advanced Advanced 0 2
19 yes 1.98 Advanced Advanced 0 2
19 yes 1.98 Advanced Advanced 0 1