The sample() function samples rows from a DataFrame, directly or based on conditions. The function creates a new column 'sampleid' which has a unique ID for each sample, helping identify each sample.
>>> from teradataml import *
>>> load_example_data("dataframe", "admissions_train")
>>> df = DataFrame("admissions_train")
>>> df.sample(n = 20).sample(frac = [0.5, 0.1])
masters gpa stats programming admitted sampleid
id
15 yes 4.00 Advanced Advanced 1 1
37 no 3.52 Novice Novice 1 1
35 no 3.68 Novice Beginner 1 2
17 no 3.83 Advanced Advanced 1 1
9 no 3.82 Advanced Advanced 1 1
3 no 3.70 Novice Beginner 1 1
34 yes 3.85 Advanced Beginner 0 1
39 yes 3.75 Advanced Beginner 0 1
36 no 3.00 Advanced Novice 0 2
19 yes 1.98 Advanced Advanced 0 1
>>> from teradataml import*
>>> load_example_data("dataframe", "admissions_train")
>>> df = DataFrame.from_table('admissions_train')
>>> df.sample(frac = 0.8).filter(items = ["masters"]).sample(n = [5, 4])
masters sampleid
0 yes 1
1 no 1
2 yes 1
3 no 2
4 no 1
5 no 2
6 yes 2
7 yes 1
8 yes 2
Example Prerequisite
>>> from teradataml import *
>>> load_example_data("dataframe", "admissions_train")
>>> df = DataFrame("admissions_train")
>>> df
masters gpa stats programming admitted
id
13 no 4.00 Advanced Novice 1
26 yes 3.57 Advanced Advanced 1
5 no 3.44 Novice Novice 0
19 yes 1.98 Advanced Advanced 0
15 yes 4.00 Advanced Advanced 1
40 yes 3.95 Novice Beginner 0
7 yes 2.33 Novice Novice 1
22 yes 3.46 Novice Beginner 0
36 no 3.00 Advanced Novice 0
38 yes 2.65 Advanced Beginner 1
Example 1: Sample one specific number of rows
This example randomly samples 2 rows from the teradataml DataFrame. As there is only one sample, the 'sampleid' is 1.
>>> df.sample(n = 2)
masters gpa stats programming admitted SampleId
id
18 yes 3.81 Advanced Advanced 1 1
19 yes 1.98 Advanced Advanced 0 1
Example 2: Sample multiple values for the number of rows to be sampled
This example creates two samples, one with 2 rows and one with 1 row. There are two values (1,2) for 'sampleid', each indicates one sample.
>> df.sample(n = [2, 1])
masters gpa stats programming admitted SampleId
id
1 yes 3.95 Beginner Beginner 0 1
10 no 3.71 Advanced Advanced 1 1
11 no 3.13 Advanced Advanced 1 2
Example 3: Sample one specific percentage of rows
This example randomly samples 20% of the total rows in the input teradataml DataFrame.
>>> df.sample(frac = 0.2)
masters gpa stats programming admitted SampleId
id
18 yes 3.81 Advanced Advanced 1 1
15 yes 4.00 Advanced Advanced 1 1
14 yes 3.45 Advanced Advanced 0 1
35 no 3.68 Novice Beginner 1 1
27 yes 3.96 Advanced Advanced 0 1
25 no 3.96 Advanced Advanced 1 1
10 no 3.71 Advanced Advanced 1 1
9 no 3.82 Advanced Advanced 1 1
Example 4: Sample multiple values for the percentage of rows to be sampled
This example creates two samples, one with 4% of total rows and one with 2% of total rows.
>>> df.sample(frac = [0.04, 0.02])
masters gpa stats programming admitted SampleId
id
29 yes 4.00 Novice Beginner 0 1
19 yes 1.98 Advanced Advanced 0 2
11 no 3.13 Advanced Advanced 1 1
Example 5: Sample specific number of rows, replace and randomization
This example creates two samples, one with 2 rows and one with 1 row, with possible redundant sampling as replace is True and also selects rows from different AMPs as randomize is True.
>>> df.sample(n = [2, 1], replace = True, randomize = True)
masters gpa stats programming admitted SampleId
id
12 no 3.65 Novice Novice 1 1
39 yes 3.75 Advanced Beginner 0 1
20 yes 3.90 Advanced Advanced 1 2
Example 6: Sample specific percentage of rows, replace and randomization
This example creates two samples, one with 4% of total rows and one with 2% of total rows in teradataml DataFrame, with possible redundant sampling and also selects rows from different AMPs.
>>> df.sample(frac = [0.04, 0.02], replace = True, randomize = True)
masters gpa stats programming admitted SampleId
id
7 yes 2.33 Novice Novice 1 2
30 yes 3.79 Advanced Novice 0 1
33 no 3.55 Novice Novice 1 1
Example 7: Sample with condition and number of samples to be sampled
This example creates two samples, with 1, 2 rows respectively from rows which satisfy df.gpa < 2 and 2.5% of rows from rows which satisfy df.stats == 'Advanced'.
>>> df.sample(case_when_then={df.gpa < 2 : [1, 2], df.stats == 'Advanced' : 0.025})
masters gpa stats programming admitted SampleId
id
19 yes 1.98 Advanced Advanced 0 1
24 no 1.87 Advanced Novice 1 1
11 no 3.13 Advanced Advanced 1 3
Example 8: Sample with condition and number of samples to be sampled, replace and randomization
This example creates two samples with 1 and 2 rows respectively from rows which satisfy df.gpa < 2 and 2.5% of rows from rows which satisfy df.stats == 'Advanced' and selects rows from different AMPs with replacement.
>>> df.sample(replace = True, randomize = True, case_when_then={df.gpa < 2 : [1, 2], df.stats == 'Advanced' : 0.025})
masters gpa stats programming admitted SampleId
id
24 no 1.87 Advanced Novice 1 1
24 no 1.87 Advanced Novice 1 2
24 no 1.87 Advanced Novice 1 2
24 no 1.87 Advanced Novice 1 2
24 no 1.87 Advanced Novice 1 2
24 no 1.87 Advanced Novice 1 1
31 yes 3.50 Advanced Beginner 1 3
Example 9: Sample with different conditions and numbers of samples to be sampled
- Two with 1, 3 rows from rows which satisfy df.gpa > 2
- One with 5 rows from rows which satisfy df.programming == 'Novice'
- One with 5 rows from rows which satisfy df.masters == 'no'
- One with 1 row from rows which does not meet all the previous conditions
>>> df.sample(case_when_then = {df.gpa > 2 : [1, 3], df.stats == 'Novice' : [1, 2], df.programming == 'Novice' : 5, df.masters == 'no': 5}, case_else = 1)
masters gpa stats programming admitted SampleId
id
24 no 1.87 Advanced Novice 1 5
2 yes 3.76 Beginner Beginner 0 1
12 no 3.65 Novice Novice 1 2
38 yes 2.65 Advanced Beginner 1 2
36 no 3.00 Advanced Novice 0 2
19 yes 1.98 Advanced Advanced 0 7
Example 10: Sample with different conditions and numbers of samples to be sampled, replace and randomization
- Two with 1, 3 rows from rows which satisfy df.gpa > 2
- Two with 2.5%, 5% of rows from rows which does not meet the previous condition with possible redundant replacement and also select rows from different AMPs
>>> df.sample(case_when_then = {df.gpa < 2 : [1, 3]}, replace = True, randomize = True, case_else = [0.025, 0.05])
masters gpa stats programming admitted SampleId
id
19 yes 1.98 Advanced Advanced 0 1
19 yes 1.98 Advanced Advanced 0 2
19 yes 1.98 Advanced Advanced 0 2
19 yes 1.98 Advanced Advanced 0 2
19 yes 1.98 Advanced Advanced 0 2
40 yes 3.95 Novice Beginner 0 3
3 no 3.70 Novice Beginner 1 4
19 yes 1.98 Advanced Advanced 0 2
19 yes 1.98 Advanced Advanced 0 2
19 yes 1.98 Advanced Advanced 0 1