Teradata Package for Python Function Reference | 17.10 - percentile - Teradata Package for Python - Look here for syntax, methods and examples for the functions included in the Teradata Package for Python.

Teradata® Package for Python Function Reference

Product

Teradata Package for Python

Release Number

17.10

Published

April 2022

Language

English (United States)

Last Update

2022-08-19

lifecycle

Product Category

Teradata Vantage

teradataml.dataframe.dataframe.DataFrameGroupByTime.percentile = percentile(self, percentile, distinct=False, interpolation='LINEAR'): DESCRIPTION: Function returns the value which represents the desired percentile from each group. The result value is determined by the desired index (di) in an ordered list of values. The following equation is for the di: di = (number of values in group - 1) * percentile/100 When di is a whole number, that value is the returned result. The di can also be between two data points, i and j, where i<j. In that case, the result is interpolated according to the value specified in interpolation argument. Note: 1. This function is valid only on columns with numeric types. 2. Null values are not included in the result computation. PARAMETERS: percentile: Required Argument. Specifies the desired percentile value to calculate. It should be between 0 and 1, both inclusive. Types: int or float distinct: Optional Argument. Specifies whether to exclude duplicate values while calculating the percentile value. Default Values: False Types: bool interpolation: Optional Argument. Specifies the interpolation type to use to interpolate the result value when the desired result lies between two data points. The desired result lies between two data points, i and j, where i<j. In this case, the result is interpolated according to the permitted values. Permitted Values: "LINEAR", "LOW", "HIGH", "NEAREST", "MIDPOINT" * LINEAR: Linear interpolation. The result value is computed using the following equation: result = i + (j - i) * (di/100)MOD 1 Specify by passing "LINEAR" as string to this parameter. * LOW: Low value interpolation. The result value is equal to i. Specify by passing "LOW" as string to this parameter. * HIGH: High value interpolation. The result value is equal to j. Specify by passing "HIGH" as string to this parameter. * NEAREST: Nearest value interpolation. The result value is i if (di/100 )MOD 1 <= .5; otherwise, it is j. Specify by passing "NEAREST" as string to this parameter. * MIDPOINT: Midpoint interpolation. The result value is equal to (i+j)/2. Specify by passing "MIDPOINT" as string to this parameter. Default Values: "LINEAR" Types: str RETURNS: teradataml DataFrame. RAISES: TypeError - If incorrect type of values passed to input argument. ValueError - If invalid value passed to the the argument. TeradataMLException - TDMLDF_AGGREGATE_FAILED - If percentile() operation fails to generate the column-wise percentile values in the columns. EXAMPLES: >>> # Load the example datasets. ... load_example_data("dataframe", ["ocean_buoys", "ocean_buoys_seq", "ocean_buoys_nonpti", "admissions_train"]) >>> # # Example 1: Executing percentile() function on DataFrame created on non-sequenced PTI table. # Calculate the 25th percentile value for all numeric columns using default # values, i.e., consider all rows (duplicate rows as well) and linear # interpolation while computing the percentile value. # >>> # Create the required DataFrames. ... # DataFrame on non-sequenced PTI table ... ocean_buoys = DataFrame("ocean_buoys") >>> # Check DataFrame columns and let's peek at the data ... ocean_buoys.columns ['TD_TIMECODE', 'buoyid', 'salinity', 'temperature'] >>> ocean_buoys.head() TD_TIMECODE salinity temperature buoyid 0 2014-01-06 08:10:00.000000 55 100.0 0 2014-01-06 08:08:59.999999 55 NaN 1 2014-01-06 09:01:25.122200 55 77.0 1 2014-01-06 09:03:25.122200 55 79.0 1 2014-01-06 09:01:25.122200 55 70.0 1 2014-01-06 09:02:25.122200 55 71.0 1 2014-01-06 09:03:25.122200 55 72.0 0 2014-01-06 08:09:59.999999 55 99.0 0 2014-01-06 08:00:00.000000 55 10.0 0 2014-01-06 08:10:00.000000 55 10.0 >>> >>> ocean_buoys_grpby1 = ocean_buoys.groupby_time(timebucket_duration="10m", value_expression="buoyid", fill="NULLS") >>> ocean_buoys_grpby1.percentile(0.25).sort(["TIMECODE_RANGE", "buoyid"]) TIMECODE_RANGE GROUP BY TIME(MINUTES(10)) buoyid percentile_salinity percentile_temperature 0 ('2014-01-06 08:00:00.000000+00:00', '2014-01-... 106033 0 55.0 32.25 1 ('2014-01-06 08:10:00.000000+00:00', '2014-01-... 106034 0 55.0 32.50 2 ('2014-01-06 09:00:00.000000+00:00', '2014-01-... 106039 1 55.0 71.25 3 ('2014-01-06 10:00:00.000000+00:00', '2014-01-... 106045 44 55.0 43.00 4 ('2014-01-06 10:10:00.000000+00:00', '2014-01-... 106046 44 55.0 43.00 5 ('2014-01-06 10:20:00.000000+00:00', '2014-01-... 106047 44 NaN NaN 6 ('2014-01-06 10:30:00.000000+00:00', '2014-01-... 106048 44 55.0 43.00 7 ('2014-01-06 10:40:00.000000+00:00', '2014-01-... 106049 44 NaN NaN 8 ('2014-01-06 10:50:00.000000+00:00', '2014-01-... 106050 44 55.0 43.00 9 ('2014-01-06 21:00:00.000000+00:00', '2014-01-... 106111 2 55.0 80.50 >>> # # Example 2: Executing percentile() function on ocean_buoys_seq DataFrame created on # sequenced PTI table. # Calculate the 50th percentile value for all numeric columns. # To calculate percentile consider all rows (duplicate rows as well) and # use "MIDPOINT" interpolation while computing the percentile value. # >>> # DataFrame on sequenced PTI table ... ocean_buoys_seq = DataFrame("ocean_buoys_seq") >>> # Check DataFrame columns and let's peek at the data ... ocean_buoys_seq.columns ['TD_TIMECODE', 'TD_SEQNO', 'buoyid', 'salinity', 'temperature', 'dates'] >>> ocean_buoys_seq.head() TD_TIMECODE TD_SEQNO salinity temperature dates buoyid 0 2014-01-06 08:00:00.000000 26 55 10.0 2016-02-26 0 2014-01-06 08:08:59.999999 18 55 NaN 2015-06-18 1 2014-01-06 09:02:25.122200 24 55 78.0 2015-12-24 1 2014-01-06 09:01:25.122200 23 55 77.0 2015-11-23 1 2014-01-06 09:02:25.122200 12 55 71.0 2014-12-12 1 2014-01-06 09:03:25.122200 13 55 72.0 2015-01-13 1 2014-01-06 09:01:25.122200 11 55 70.0 2014-11-11 0 2014-01-06 08:10:00.000000 19 55 10.0 2015-07-19 0 2014-01-06 08:09:59.999999 17 55 99.0 2015-05-17 0 2014-01-06 08:10:00.000000 27 55 100.0 2016-03-27 >>> >>> ocean_buoys_seq_grpby1 = ocean_buoys_seq.groupby_time(timebucket_duration="1cy", value_expression="buoyid", fill="NULLS") >>> ocean_buoys_seq_grpby1.percentile(0.5, interpolation="MIDPOINT").sort(["TIMECODE_RANGE", "buoyid"]) TIMECODE_RANGE GROUP BY TIME(CAL_YEARS(1)) buoyid percentile_TD_SEQNO percentile_salinity percentile_temperature 0 ('2014-01-01 00:00:00.000000-00:00', '2015-01-... 3 0 22.5 55.0 54.5 1 ('2014-01-01 00:00:00.000000-00:00', '2015-01-... 3 1 18.0 55.0 74.5 2 ('2014-01-01 00:00:00.000000-00:00', '2015-01-... 3 2 15.5 55.0 81.5 3 ('2014-01-01 00:00:00.000000-00:00', '2015-01-... 3 22 1.0 25.0 23.0 4 ('2014-01-01 00:00:00.000000-00:00', '2015-01-... 3 44 7.5 55.0 48.0 >>> # # Example 3: Executing percentile() function for all numeric columns in # teradataml DataFrame created on NON-PTI table. # Calculate the 75th percentile value, exclude duplicate rows and # "LOW" as interpolation. # >>> # DataFrame on NON-PTI table ... ocean_buoys_nonpti = DataFrame("ocean_buoys_nonpti") >>> # Check DataFrame columns and let's peek at the data ... ocean_buoys_nonpti.columns ['timecode', 'buoyid', 'salinity', 'temperature'] >>> ocean_buoys_nonpti.head() buoyid salinity temperature timecode 2014-01-06 08:09:59.999999 0 55 99.0 2014-01-06 08:10:00.000000 0 55 100.0 2014-01-06 09:01:25.122200 1 55 70.0 2014-01-06 09:01:25.122200 1 55 77.0 2014-01-06 09:02:25.122200 1 55 71.0 2014-01-06 09:03:25.122200 1 55 72.0 2014-01-06 09:02:25.122200 1 55 78.0 2014-01-06 08:10:00.000000 0 55 10.0 2014-01-06 08:08:59.999999 0 55 NaN 2014-01-06 08:00:00.000000 0 55 10.0 >>> >>> ocean_buoys_nonpti_grpby1 = ocean_buoys_nonpti.groupby_time(timebucket_duration="1cy", value_expression="buoyid", timecode_column="timecode", fill="NULLS") >>> ocean_buoys_nonpti_grpby1.percentile(0.75, distinct=True, interpolation="low").sort(["TIMECODE_RANGE", "buoyid"]) TIMECODE_RANGE GROUP BY TIME(CAL_YEARS(1)) buoyid percentile_salinity percentile_temperature 0 ('2014-01-01 00:00:00.000000-00:00', '2015-01-... 45 0 55.0 99.0 1 ('2014-01-01 00:00:00.000000-00:00', '2015-01-... 45 1 55.0 77.0 2 ('2014-01-01 00:00:00.000000-00:00', '2015-01-... 45 2 55.0 81.0 3 ('2014-01-01 00:00:00.000000-00:00', '2015-01-... 45 44 55.0 55.0 >>> # # Example 4: Executing percentile() function on DataFrame created on a regular table. # Calculate the 25th percentile value for all numeric columns using default # values, i.e., consider all rows (duplicate rows as well) and linear # interpolation while computing the percentile value. # >>> # Create the required DataFrame on non PTI table. ... admissions_train = DataFrame("admissions_train") >>> # Check DataFrame columns and let's peek at the data ... admissions_train.columns ['id', 'masters', 'gpa', 'stats', 'programming', 'admitted'] >>> admissions_train.head() masters gpa stats programming admitted id 3 no 3.70 Novice Beginner 1 5 no 3.44 Novice Novice 0 6 yes 3.50 Beginner Advanced 1 7 yes 2.33 Novice Novice 1 9 no 3.82 Advanced Advanced 1 10 no 3.71 Advanced Advanced 1 8 no 3.60 Beginner Advanced 1 4 yes 3.50 Beginner Novice 1 2 yes 3.76 Beginner Beginner 0 1 yes 3.95 Beginner Beginner 0 >>> df = admissions_train.groupby("admitted").percentile(0.25) >>> df admitted percentile_id percentile_gpa 0 0 15 3.4525 1 1 10 3.5050 >>> # # Example 5: Executing percentile() function on DataFrame created on a regular table. # Calculate the 35th percentile value for all numeric columns using default # values, i.e., consider all rows (duplicate rows as well) and no # interpolation while computing the percentile value. # >>> # Create the required DataFrame on non PTI table. ... admissions_train = DataFrame("admissions_train") >>> # Check DataFrame columns and let's peek at the data ... admissions_train.columns ['id', 'masters', 'gpa', 'stats', 'programming', 'admitted'] >>> admissions_train.head() masters gpa stats programming admitted id 3 no 3.70 Novice Beginner 1 5 no 3.44 Novice Novice 0 6 yes 3.50 Beginner Advanced 1 7 yes 2.33 Novice Novice 1 9 no 3.82 Advanced Advanced 1 10 no 3.71 Advanced Advanced 1 8 no 3.60 Beginner Advanced 1 4 yes 3.50 Beginner Novice 1 2 yes 3.76 Beginner Beginner 0 1 yes 3.95 Beginner Beginner 0 >>> df = admissions_train.groupby("admitted").percentile(0.25, interpolation=None) >>> df admitted percentile_id percentile_gpa 0 0 19 3.46 1 1 13 3.57 >>>