sample() Method | Teradata Python Package - sample() Method - Teradata Package for Python

Teradata® Package for Python User Guide

Product
Teradata Package for Python
Release Number
17.00
Published
November 2021
Language
English (United States)
Last Update
2022-01-14
dita:mapPath
bol1585763678431.ditamap
dita:ditavalPath
ayr1485454803741.ditaval
dita:id
B700-4006
lifecycle
previous
Product Category
Teradata Vantage

The sample() function samples rows from a DataFrame, directly or based on conditions. The function creates a new column 'sampleid' which has a unique ID for each sample, helping identify each sample.

If more than one sample() operations are performed on teradataml DataFrame, then 'sampleid' from the latest call is projected. Previous 'sampleid' columns are ignored.
In this example, 'sampleid' column shown is from the latest sample() operation (sample(frac = [0.5, 0.1]).
>>> from teradataml import *
>>> load_example_data("dataframe", "admissions_train")
>>> df = DataFrame("admissions_train")

>>> df.sample(n = 20).sample(frac = [0.5, 0.1])
   masters   gpa     stats programming  admitted  sampleid
id
15     yes  4.00  Advanced    Advanced         1         1
37      no  3.52    Novice      Novice         1         1
35      no  3.68    Novice    Beginner         1         2
17      no  3.83  Advanced    Advanced         1         1
9       no  3.82  Advanced    Advanced         1         1
3       no  3.70    Novice    Beginner         1         1
34     yes  3.85  Advanced    Beginner         0         1
39     yes  3.75  Advanced    Beginner         0         1
36      no  3.00  Advanced      Novice         0         2
19     yes  1.98  Advanced    Advanced         0         1
In the following example, two sample() operations are performed on a DataFrame, but not consecutively. The 'sampleid' column in the result is from the latest sample operation (sample(n = [5, 4])). 'sampleid' column from the previous call is ignored.
>>> from teradataml import*
>>> load_example_data("dataframe", "admissions_train")
>>> df = DataFrame.from_table('admissions_train')

>>> df.sample(frac = 0.8).filter(items = ["masters"]).sample(n = [5, 4])
  masters  sampleid
0     yes         1
1      no         1
2     yes         1
3      no         2
4      no         1
5      no         2
6     yes         2
7     yes         1
8     yes         2

Example Prerequisite

>>> from teradataml import *
>>> load_example_data("dataframe", "admissions_train")
>>> df = DataFrame("admissions_train")
>>> df
      masters   gpa     stats programming admitted
   id
   13      no  4.00  Advanced      Novice        1
   26     yes  3.57  Advanced    Advanced        1
   5       no  3.44    Novice      Novice        0
   19     yes  1.98  Advanced    Advanced        0
   15     yes  4.00  Advanced    Advanced        1
   40     yes  3.95    Novice    Beginner        0
   7      yes  2.33    Novice      Novice        1
   22     yes  3.46    Novice    Beginner        0
   36      no  3.00  Advanced      Novice        0
   38     yes  2.65  Advanced    Beginner        1

Example 1: Sample one specific number of rows

This example randomly samples 2 rows from the teradataml DataFrame. As there is only one sample, the 'sampleid' is 1.

>>> df.sample(n = 2)
      masters   gpa     stats programming admitted SampleId
   id
   18     yes  3.81  Advanced    Advanced        1        1
   19     yes  1.98  Advanced    Advanced        0        1

Example 2: Sample multiple values for the number of rows to be sampled

This example creates two samples, one with 2 rows and one with 1 row. There are two values (1,2) for 'sampleid', each indicates one sample.

>> df.sample(n = [2, 1])
      masters   gpa     stats programming admitted SampleId
   id
   1      yes  3.95  Beginner    Beginner        0        1
   10      no  3.71  Advanced    Advanced        1        1
   11      no  3.13  Advanced    Advanced        1        2

Example 3: Sample one specific percentage of rows

This example randomly samples 20% of the total rows in the input teradataml DataFrame.

>>> df.sample(frac = 0.2)
      masters   gpa     stats programming admitted SampleId
   id
   18     yes  3.81  Advanced    Advanced        1        1
   15     yes  4.00  Advanced    Advanced        1        1
   14     yes  3.45  Advanced    Advanced        0        1
   35      no  3.68    Novice    Beginner        1        1
   27     yes  3.96  Advanced    Advanced        0        1
   25      no  3.96  Advanced    Advanced        1        1
   10      no  3.71  Advanced    Advanced        1        1
   9       no  3.82  Advanced    Advanced        1        1

Example 4: Sample multiple values for the percentage of rows to be sampled

This example creates two samples, one with 4% of total rows and one with 20% of total rows.

>>> df.sample(frac = [0.04, 0.02])
      masters   gpa     stats programming admitted SampleId
   id
   29     yes  4.00    Novice    Beginner        0        1
   19     yes  1.98  Advanced    Advanced        0        2
   11      no  3.13  Advanced    Advanced        1        1

Example 5: Sample specific number of rows, replace and randomization

This example creates two samples, one with 2 rows and one with 1 row, with possible redundant sampling as replace is True and also selects rows from different AMPs as randomize is True.

>>> df.sample(n = [2, 1], replace = True, randomize = True)
      masters   gpa     stats programming admitted SampleId
   id
   12      no  3.65    Novice      Novice        1        1
   39     yes  3.75  Advanced    Beginner        0        1
   20     yes  3.90  Advanced    Advanced        1        2

Example 6: Sample specific percentage of rows, replace and randomization

This example creates two samples, one with 4% of total rows and one with 2% of total rows in teradataml DataFrame, with possible redundant sampling and also selects rows from different AMPs.

>>> df.sample(frac = [0.04, 0.02], replace = True, randomize = True)
      masters   gpa     stats programming admitted SampleId
   id
   7      yes  2.33    Novice      Novice        1        2
   30     yes  3.79  Advanced      Novice        0        1
   33      no  3.55    Novice      Novice        1        1

Example 7: Sample with condition and number of samples to be sampled

This example creates two samples, with 1, 2 rows respectively from rows which satisfy df.gpa < 2 and 2.5% of rows from rows which satisfy df.stats == 'Advanced'.

>>> df.sample(case_when_then={df.gpa < 2 : [1, 2], df.stats == 'Advanced' : 0.025})
      masters   gpa     stats programming admitted SampleId
   id
   19     yes  1.98  Advanced    Advanced        0        1
   24      no  1.87  Advanced      Novice        1        1
   11      no  3.13  Advanced    Advanced        1        3

Example 8: Sample with condition and number of samples to be sampled, replace and randomization

This example creates two samples with 1 and 2 rows respectively from rows which satisfy df.gpa < 2 and 2.5% of rows from rows which satisfy df.stats == 'Advanced' and selects rows from different AMPs with replacement.

>>> df.sample(replace = True, randomize = True, case_when_then={df.gpa < 2 : [1, 2], df.stats == 'Advanced' : 0.025})
      masters   gpa     stats programming admitted SampleId
   id
   24      no  1.87  Advanced      Novice        1        1
   24      no  1.87  Advanced      Novice        1        2
   24      no  1.87  Advanced      Novice        1        2
   24      no  1.87  Advanced      Novice        1        2
   24      no  1.87  Advanced      Novice        1        2
   24      no  1.87  Advanced      Novice        1        1
   31     yes  3.50  Advanced    Beginner        1        3

Example 9: Sample with different conditions and numbers of samples to be sampled

This example creates creates seven samples:
  • Two with 1, 3 rows from rows which satisfy df.gpa > 2
  • One with 5 rows from rows which satisfy df.programming == 'Novice'
  • One with 5 rows from rows which satisfy df.masters == 'no'
  • One with 1 row from rows which does not meet all above conditions
>>> df.sample(case_when_then = {df.gpa > 2 : [1, 3], df.stats == 'Novice' : [1, 2], df.programming == 'Novice' : 5, df.masters == 'no': 5}, case_else = 1)
      masters   gpa     stats programming admitted SampleId
   id
   24      no  1.87  Advanced      Novice        1        5
   2      yes  3.76  Beginner    Beginner        0        1
   12      no  3.65    Novice      Novice        1        2
   38     yes  2.65  Advanced    Beginner        1        2
   36      no  3.00  Advanced      Novice        0        2
   19     yes  1.98  Advanced    Advanced        0        7

Example 10: Sample with different conditions and numbers of samples to be sampled, replace and randomization

This example creates Four samples:
  • Two with 1, 3 rows from rows which satisfy df.gpa > 2
  • Two with 2.5%, 5% of rows from rows which does not meet above condition with possible redundant replacement and also select rows from different AMPs
>>> df.sample(case_when_then = {df.gpa < 2 : [1, 3]}, replace = True, randomize = True, case_else = [0.025, 0.05])
      masters   gpa     stats programming admitted SampleId
   id
   19     yes  1.98  Advanced    Advanced        0        1
   19     yes  1.98  Advanced    Advanced        0        2
   19     yes  1.98  Advanced    Advanced        0        2
   19     yes  1.98  Advanced    Advanced        0        2
   19     yes  1.98  Advanced    Advanced        0        2
   40     yes  3.95    Novice    Beginner        0        3
   3       no  3.70    Novice    Beginner        1        4
   19     yes  1.98  Advanced    Advanced        0        2
   19     yes  1.98  Advanced    Advanced        0        2
   19     yes  1.98  Advanced    Advanced        0        1