Teradata Package for Python Function Reference | 17.10 - sample - Teradata Package for Python - Look here for syntax, methods and examples for the functions included in the Teradata Package for Python.

Teradata® Package for Python Function Reference

Product

Teradata Package for Python

Release Number

17.10

Published

April 2022

Language

English (United States)

Last Update

2022-08-19

lifecycle

Product Category

Teradata Vantage

teradataml.geospatial.geodataframe.GeoDataFrame.sample = sample(self, n=None, frac=None, replace=False, randomize=False, case_when_then=None, case_else=None): DESCRIPTION: Allows to sample few rows from GeoDataFrame directly or based on conditions. Creates a new column 'sampleid' which has a unique id for each sample sampled, it helps to uniquely identify each sample. PARAMETERS: n: Required Argument, if neither of 'frac' and 'case_when_then' are specified. Specifies a set of positive integer constants that specifies the number of rows to be sampled from the teradataml GeoDataFrame. Example: n = 10 or n = [10] or n = [10, 20, 30, 40] Default Value: None Types: int or list of ints. Note: 1. You should use only one of the following arguments: 'n', 'frac' and 'case_when_then'. 2. No more than 16 samples can be requested per count description. frac: Required Argument, if neither of 'n' and 'case_when_then' are specified. Specifies any set of unsigned floating point constant numbers in the half opened interval (0,1] that means greater than 0 and less than or equal to 1. It specifies the percentage of rows to be sampled from the teradataml GeoDataFrame. Example: frac = 0.4 or frac = [0.4] or frac = [0.2, 0.5] Default Value: None Types: float or list of floats. Note: 1. You should use only one of the following arguments: 'n', 'frac' and 'case_when_then'. 2. No more than 16 samples can be requested per count description. 3. Sum of elements in list should not be greater than 1 as total percentage cannot be more than 100% and should not be less than or equal to 0. replace: Optional Argument. Specifies if sampling should be done with replacement or not. Default Value: False Types: bool randomize: Optional Argument. Specifies if sampling should be done across AMPs in Teradata or per AMP. Default Value: False Types: bool case_when_then : Required Argument, if neither of 'frac' and 'n' are specified. Specifies condition and number of samples to be sampled as key value pairs. Keys should be of type ColumnExpressions. Values should be either of type int, float, list of ints or list of floats. The following usage of key is not allowed: case_when_then = {"gpa" > 2 : 2} The following operators are supported: comparison: ==, !=, <, <=, >, >= boolean: & (and), | (or), ~ (not), ^ (xor) Example : case_when_then = {df.gpa > 2 : 2} case_when_then = {df.gpa > 2 & df.stats == 'Novice' : [0.2, 0.3], df.programming == 'Advanced' : [10,20,30]} Default Value: None Types: dictionary Note: 1. You should use only one of the following arguments: 'n', 'frac' and 'case_when_then'. 2. No more than 16 samples can be requested per fraction description or count description. 3. If any value in dictionary is specified as list of floats then sum of elements in list should not be greater than 1 as total percentage cannot be more than 100% and should not be less than or equal to 0. case_else : Optional Argument. Specifies number of samples to be sampled from rows where none of the conditions in 'case_when_then' are met. Example : case_else = 10 case_else = [10,20] case_else = [0.5] case_else = [0.2,0.4] Default Value: None Types: int or float or list of ints or list of floats Note: 1. This argument can only be used with 'case_when_then'. If used otherwise, below error will raised. 'case_else' can only be used when 'case_when_then' is specified. 2. No more than 16 samples can be requested per fraction description or count description. 3. If case_else is list of floats then sum of elements in list should not be greater than 1 as total percentage cannot be more than 100% and should not be less than or equal to 0. RETURNS: teradataml GeoDataFrame RAISES: 1. ValueError - When columns of different GeoDataFrames are given in ColumnExpression. or When columns are given in string format and not ColumnExpression. 2. TeradataMlException - If types of input parameters are mismatched. 3. TypeError Examples: >>> from teradataml import load_example_data, GeoDataFrame >>> load_example_data("geodataframe","sample_shapes") >>> df = GeoDataFrame("sample_shapes").select(["skey", "points"]) >>> # Print GeoDataFrame. >>> df points skey 1006 POINT (235.52 54.546 7.4564) 1001 POINT (10 20) 1002 POINT (1 3) 1010 MULTIPOINT (10.345 20.32 30.6,40.234 50.23 60.24,70.234 80.56 80.234) 1004 POINT (10 20 30) 1003 POINT (235.52 54.546) 1008 MULTIPOINT (1.65 1.76,1.23 3.76,6.23 3.78,10.76 5.9,20.32 1.231) 1005 POINT (1 3 5) 1007 MULTIPOINT (1 1,1 3,6 3,10 5,20 1) 1009 MULTIPOINT (10 20 30,40 50 60,70 80 80) >>> # Example 1: Sample with only n argument. # Randomly samples 2 rows from the teradataml GeoDataFrame. # As there is only 1 sample 'sampleid' is 1. >>> df.sample(n = 2) points sampleid skey 1008 MULTIPOINT (1.65 1.76,1.23 3.76,6.23 3.78,10.76 5.9,20.32 1.231) 1 1001 POINT (10 20) 1 >>> # Example 2: Sample with multiple sample values for n. # Creates 2 samples with 2 and 1 rows each respectively. # There are 2 values(1,2) for 'sampleid' each for one sample. >>> df.sample(n = [2, 1]) points sampleid skey 1003 POINT (235.52 54.546) 1 1008 MULTIPOINT (1.65 1.76,1.23 3.76,6.23 3.78,10.76 5.9,20.32 1.231) 2 1001 POINT (10 20) 1 >>> # Example 3: Sample with only frac parameter. # Randomly samples 20% of total rows present in teradataml GeoDataFrame. >>> df.sample(frac = 0.2) points sampleid skey 1004 POINT (10 20 30) 1 1001 POINT (10 20) 1 >>> # Example 4: Sample with multiple sample values for frac. # Creates 2 samples each with 40% and 20% of total rows in teradataml GeoDataFrame. >>> df.sample(frac = [0.4, 0.2]) points sampleid skey 1001 POINT (10 20) 1 1004 POINT (10 20 30) 1 1003 POINT (235.52 54.546) 2 1008 MULTIPOINT (1.65 1.76,1.23 3.76,6.23 3.78,10.76 5.9,20.32 1.231) 1 1006 POINT (235.52 54.546 7.4564) 2 1009 MULTIPOINT (10 20 30,40 50 60,70 80 80) 1 >>> # Example 5: Sample with n and replace and randomization. # Creates 2 samples with 2 and 1 rows respectively with possible redundant # sampling as replace is True and also selects rows from different AMPS as # randomize is True. >>> df.sample(n = [2, 1], replace = True, randomize = True) points sampleid skey 1005 POINT (1 3 5) 2 1009 MULTIPOINT (10 20 30,40 50 60,70 80 80) 1 1009 MULTIPOINT (10 20 30,40 50 60,70 80 80) 1 >>> # Example 6: Sample with frac and replace and randomization. # Creates 2 samples with 40% and 20% of total rows in teradataml GeoDataFrame # respectively with possible redundant sampling and also selects rows from different AMPS. >>> df.sample(frac = [0.4, 0.2], replace = True, randomize = True) points sampleid skey 1002 POINT (1 3) 1 1004 POINT (10 20 30) 1 1004 POINT (10 20 30) 2 1002 POINT (1 3) 1 1005 POINT (1 3 5) 2 1007 MULTIPOINT (1 1,1 3,6 3,10 5,20 1) 1 >>> # Example 7: Sample with case_when_then. # Creates 2 samples with 1, 2 rows respectively from rows which satisfy df.skey < 1004 # and 25% of rows from rows which satisfy df.skey >= 1004. >>> df.sample(case_when_then={df.skey < 1004 : [1, 2], df.skey >= 1004 : 0.25}) points sampleid skey 1010 MULTIPOINT (10.345 20.32 30.6,40.234 50.23 60.24,70.234 80.56 80.234) 3 1003 POINT (235.52 54.546) 1 1002 POINT (1 3) 1 1001 POINT (10 20) 1 1009 MULTIPOINT (10 20 30,40 50 60,70 80 80) 3 >>> # Example 8: Sample with case_when_then and replace, randomize. # Creates 2 samples with 1, 2 rows respectively from rows which satisfy df.skey < 1004 # and 25% of rows from rows which satisfy df.skey >= 1004 and selects rows # from different AMPs with replacement. >>> df.sample(replace = True, randomize = True, case_when_then={df.skey < 1004 : [1, 2], ... df.skey >= 1004 : 0.25}) points sampleid skey 1001 POINT (10 20) 1 1001 POINT (10 20) 2 1001 POINT (10 20) 2 1001 POINT (10 20) 2 1002 POINT (1 3) 1 1002 POINT (1 3) 2 1002 POINT (1 3) 2 1001 POINT (10 20) 2 1001 POINT (10 20) 1 1007 MULTIPOINT (1 1,1 3,6 3,10 5,20 1) 3 >>> # Example 9: Sample with case_when_then and case_else # Creates 4 samples 2 with 1, 3 rows from rows which satisfy df.skey < 1004. # 2 samples with 25%, 50% of rows from all the rows which does not # meet condition df.skey < 1004. >>> df.sample(case_when_then = {df.skey < 1004 : [1, 3]}, case_else = [0.25, 0.5]) points sampleid skey 1005 POINT (1 3 5) 3 1010 MULTIPOINT (10.345 20.32 30.6,40.234 50.23 60.24,70.234 80.56 80.234) 4 1008 MULTIPOINT (1.65 1.76,1.23 3.76,6.23 3.78,10.76 5.9,20.32 1.231) 3 1004 POINT (10 20 30) 4 1003 POINT (235.52 54.546) 1 1002 POINT (1 3) 1 1001 POINT (10 20) 1 1006 POINT (235.52 54.546 7.4564) 4 1007 MULTIPOINT (1 1,1 3,6 3,10 5,20 1) 4 >>> # Example 10: Sample with case_when_then, case_else, replace, randomize # Creates 4 samples 2 with 1, 3 rows from rows which satisfy df.skey < 1004 and # 2 samples with 25%, 50% of rows from all the rows which does not # meet condition df.skey < 1004 with possible redundant replacement # and also selects rows from different AMPs >>> df.sample(case_when_then = {df.skey < 1004 : [1, 3]}, replace = True, ... randomize = True, case_else = [0.25, 0.5]) points sampleid skey 1009 MULTIPOINT (10 20 30,40 50 60,70 80 80) 4 1010 MULTIPOINT (10.345 20.32 30.6,40.234 50.23 60.24,70.234 80.56 80.234) 4 1008 MULTIPOINT (1.65 1.76,1.23 3.76,6.23 3.78,10.76 5.9,20.32 1.231) 4 1002 POINT (1 3) 1 1002 POINT (1 3) 1 1002 POINT (1 3) 2 1002 POINT (1 3) 2 1002 POINT (1 3) 2 1002 POINT (1 3) 2 1002 POINT (1 3) 1 >>>