The sample() function samples rows from a DataFrame, directly or based on conditions. The function creates a new column 'sampleid' which has a unique ID for each sample, helping identify each sample.
>>> from teradataml import * >>> load_example_data("dataframe", "admissions_train") >>> df = DataFrame("admissions_train") >>> df.sample(n = 20).sample(frac = [0.5, 0.1]) masters gpa stats programming admitted sampleid id 15 yes 4.00 Advanced Advanced 1 1 37 no 3.52 Novice Novice 1 1 35 no 3.68 Novice Beginner 1 2 17 no 3.83 Advanced Advanced 1 1 9 no 3.82 Advanced Advanced 1 1 3 no 3.70 Novice Beginner 1 1 34 yes 3.85 Advanced Beginner 0 1 39 yes 3.75 Advanced Beginner 0 1 36 no 3.00 Advanced Novice 0 2 19 yes 1.98 Advanced Advanced 0 1
>>> from teradataml import* >>> load_example_data("dataframe", "admissions_train") >>> df = DataFrame.from_table('admissions_train') >>> df.sample(frac = 0.8).filter(items = ["masters"]).sample(n = [5, 4]) masters sampleid 0 yes 1 1 no 1 2 yes 1 3 no 2 4 no 1 5 no 2 6 yes 2 7 yes 1 8 yes 2
Example Prerequisite
>>> from teradataml import * >>> load_example_data("dataframe", "admissions_train") >>> df = DataFrame("admissions_train")
>>> df masters gpa stats programming admitted id 13 no 4.00 Advanced Novice 1 26 yes 3.57 Advanced Advanced 1 5 no 3.44 Novice Novice 0 19 yes 1.98 Advanced Advanced 0 15 yes 4.00 Advanced Advanced 1 40 yes 3.95 Novice Beginner 0 7 yes 2.33 Novice Novice 1 22 yes 3.46 Novice Beginner 0 36 no 3.00 Advanced Novice 0 38 yes 2.65 Advanced Beginner 1
Example 1: Sample one specific number of rows
This example randomly samples 2 rows from the teradataml DataFrame. As there is only one sample, the 'sampleid' is 1.
>>> df.sample(n = 2) masters gpa stats programming admitted SampleId id 18 yes 3.81 Advanced Advanced 1 1 19 yes 1.98 Advanced Advanced 0 1
Example 2: Sample multiple values for the number of rows to be sampled
This example creates two samples, one with 2 rows and one with 1 row. There are two values (1,2) for 'sampleid', each indicates one sample.
>> df.sample(n = [2, 1]) masters gpa stats programming admitted SampleId id 1 yes 3.95 Beginner Beginner 0 1 10 no 3.71 Advanced Advanced 1 1 11 no 3.13 Advanced Advanced 1 2
Example 3: Sample one specific percentage of rows
This example randomly samples 20% of the total rows in the input teradataml DataFrame.
>>> df.sample(frac = 0.2) masters gpa stats programming admitted SampleId id 18 yes 3.81 Advanced Advanced 1 1 15 yes 4.00 Advanced Advanced 1 1 14 yes 3.45 Advanced Advanced 0 1 35 no 3.68 Novice Beginner 1 1 27 yes 3.96 Advanced Advanced 0 1 25 no 3.96 Advanced Advanced 1 1 10 no 3.71 Advanced Advanced 1 1 9 no 3.82 Advanced Advanced 1 1
Example 4: Sample multiple values for the percentage of rows to be sampled
This example creates two samples, one with 4% of total rows and one with 2% of total rows.
>>> df.sample(frac = [0.04, 0.02]) masters gpa stats programming admitted SampleId id 29 yes 4.00 Novice Beginner 0 1 19 yes 1.98 Advanced Advanced 0 2 11 no 3.13 Advanced Advanced 1 1
Example 5: Sample specific number of rows, replace and randomization
This example creates two samples, one with 2 rows and one with 1 row, with possible redundant sampling as replace is True and also selects rows from different AMPs as randomize is True.
>>> df.sample(n = [2, 1], replace = True, randomize = True) masters gpa stats programming admitted SampleId id 12 no 3.65 Novice Novice 1 1 39 yes 3.75 Advanced Beginner 0 1 20 yes 3.90 Advanced Advanced 1 2
Example 6: Sample specific percentage of rows, replace and randomization
This example creates two samples, one with 4% of total rows and one with 2% of total rows in teradataml DataFrame, with possible redundant sampling and also selects rows from different AMPs.
>>> df.sample(frac = [0.04, 0.02], replace = True, randomize = True) masters gpa stats programming admitted SampleId id 7 yes 2.33 Novice Novice 1 2 30 yes 3.79 Advanced Novice 0 1 33 no 3.55 Novice Novice 1 1
Example 7: Sample with condition and number of samples to be sampled
This example creates two samples, with 1, 2 rows respectively from rows which satisfy df.gpa < 2 and 2.5% of rows from rows which satisfy df.stats == 'Advanced'.
>>> df.sample(case_when_then={df.gpa < 2 : [1, 2], df.stats == 'Advanced' : 0.025}) masters gpa stats programming admitted SampleId id 19 yes 1.98 Advanced Advanced 0 1 24 no 1.87 Advanced Novice 1 1 11 no 3.13 Advanced Advanced 1 3
Example 8: Sample with condition and number of samples to be sampled, replace and randomization
This example creates two samples with 1 and 2 rows respectively from rows which satisfy df.gpa < 2 and 2.5% of rows from rows which satisfy df.stats == 'Advanced' and selects rows from different AMPs with replacement.
>>> df.sample(replace = True, randomize = True, case_when_then={df.gpa < 2 : [1, 2], df.stats == 'Advanced' : 0.025}) masters gpa stats programming admitted SampleId id 24 no 1.87 Advanced Novice 1 1 24 no 1.87 Advanced Novice 1 2 24 no 1.87 Advanced Novice 1 2 24 no 1.87 Advanced Novice 1 2 24 no 1.87 Advanced Novice 1 2 24 no 1.87 Advanced Novice 1 1 31 yes 3.50 Advanced Beginner 1 3
Example 9: Sample with different conditions and numbers of samples to be sampled
- Two with 1, 3 rows from rows which satisfy df.gpa > 2
- One with 5 rows from rows which satisfy df.programming == 'Novice'
- One with 5 rows from rows which satisfy df.masters == 'no'
- One with 1 row from rows which does not meet all the previous conditions
>>> df.sample(case_when_then = {df.gpa > 2 : [1, 3], df.stats == 'Novice' : [1, 2], df.programming == 'Novice' : 5, df.masters == 'no': 5}, case_else = 1) masters gpa stats programming admitted SampleId id 24 no 1.87 Advanced Novice 1 5 2 yes 3.76 Beginner Beginner 0 1 12 no 3.65 Novice Novice 1 2 38 yes 2.65 Advanced Beginner 1 2 36 no 3.00 Advanced Novice 0 2 19 yes 1.98 Advanced Advanced 0 7
Example 10: Sample with different conditions and numbers of samples to be sampled, replace and randomization
- Two with 1, 3 rows from rows which satisfy df.gpa > 2
- Two with 2.5%, 5% of rows from rows which does not meet the previous condition with possible redundant replacement and also select rows from different AMPs
>>> df.sample(case_when_then = {df.gpa < 2 : [1, 3]}, replace = True, randomize = True, case_else = [0.025, 0.05]) masters gpa stats programming admitted SampleId id 19 yes 1.98 Advanced Advanced 0 1 19 yes 1.98 Advanced Advanced 0 2 19 yes 1.98 Advanced Advanced 0 2 19 yes 1.98 Advanced Advanced 0 2 19 yes 1.98 Advanced Advanced 0 2 40 yes 3.95 Novice Beginner 0 3 3 no 3.70 Novice Beginner 1 4 19 yes 1.98 Advanced Advanced 0 2 19 yes 1.98 Advanced Advanced 0 2 19 yes 1.98 Advanced Advanced 0 1