Teradata Package for Python Function Reference | 17.10 - std - Teradata Package for Python - Look here for syntax, methods and examples for the functions included in the Teradata Package for Python.

Teradata® Package for Python Function Reference

Product

Teradata Package for Python

Release Number

17.10

Published

April 2022

Language

English (United States)

Last Update

2022-08-19

lifecycle

Product Category

Teradata Vantage

teradataml.dataframe.dataframe.DataFrameGroupByTime.std = std(self, distinct=False, population=False): DESCRIPTION: Returns column-wise sample or population standard deviation value of the dataframe. The standard deviation is the second moment of a distribution. * For a sample, it is a measure of dispersion from the mean of that sample. * For a population, it is a measure of dispersion from the mean of that population. The computation is more conservative for the population standard deviation to minimize the effect of outliers on the computed value. Note: 1. When there are fewer than two non-null data points in the sample used for the computation, then std returns None. 2. Null values are not included in the result computation. 3. If data represents only a sample of the entire population for the columns, Teradata recommends to calculate sample standard deviation, otherwise calculate population standard deviation. PARAMETERS: distinct: Optional Argument. Specifies whether to exclude duplicate values while calculating the standard deviation. Default Value: False Types: bool population: Optional Argument. Specifies whether to calculate standard deviation on entire population or not. Set this argument to True only when the data points represent the complete population. If your data represents only a sample of the entire population for the columns, then set this variable to False, which will compute the sample standard deviation. As the sample size increases, even though the values for sample standard deviation and population standard deviation approach the same number, you should always use the more conservative sample standard deviation calculation, unless you are absolutely certain that your data constitutes the entire population for the columns. Default Value: False Types: bool RETURNS: teradataml DataFrame object with std() operation performed. RAISES: 1. EXECUTION_FAILED - If std() operation fails to generate the column-wise standard deviation of the dataframe. Possible error message: Failed to perform 'std'. (Followed by error message) 2. TDMLDF_AGGREGATE_COMBINED_ERR - If the std() operation doesn't support all the columns in the dataframe. Possible error message: No results. Below is/are the error message(s): All selected columns [(col2 - PERIOD_TIME), (col3 - BLOB)] is/are unsupported for 'std' operation. EXAMPLES : # Load the data to run the example. >>> from teradataml.data.load_example_data import load_example_data >>> load_example_data("dataframe", ["employee_info"]) # Create teradataml dataframe. >>> df1 = DataFrame("employee_info") >>> print(df1) first_name marks dob joined_date employee_no 101 abcde None None 02/12/05 100 abcd None None None 112 None None None 18/12/05 >>> # Select only subset of columns from the DataFrame. >>> df2 = df1.select(['employee_no', 'first_name', 'marks', 'joined_date']) # Prints sample standard deviation of each column(with supported data types). >>> df2.std() std_employee_no std_marks std_joined_date 0 6.658328 None 82/03/09 >>> # Prints population standard deviation of each column(with supported data types). >>> df2.std(population=True) std_employee_no std_marks std_joined_date 0 5.436502 None 58/02/28 >>> # # Using std() as Time Series Aggregate. # >>> # Load the example datasets. ... load_example_data("dataframe", ["ocean_buoys"]) >>> # # Time Series Aggregate Example 1: Executing std() function on DataFrame created on # non-sequenced PTI table. We will consider all rows for the # columns while calculating the standard deviation. # >>> # Create the required DataFrames. ... # DataFrame on non-sequenced PTI table ... ocean_buoys = DataFrame("ocean_buoys") >>> # Check DataFrame columns and let's peek at the data ... ocean_buoys.columns ['buoyid', 'TD_TIMECODE', 'temperature', 'salinity'] >>> ocean_buoys.head() TD_TIMECODE temperature salinity buoyid 0 2014-01-06 08:10:00.000000 100.0 55 0 2014-01-06 08:08:59.999999 NaN 55 1 2014-01-06 09:01:25.122200 77.0 55 1 2014-01-06 09:03:25.122200 79.0 55 1 2014-01-06 09:01:25.122200 70.0 55 1 2014-01-06 09:02:25.122200 71.0 55 1 2014-01-06 09:03:25.122200 72.0 55 0 2014-01-06 08:09:59.999999 99.0 55 0 2014-01-06 08:00:00.000000 10.0 55 0 2014-01-06 08:10:00.000000 10.0 55 # To use std() as Time Series Aggregate we must run groupby_time() first, followed by std(). >>> ocean_buoys_grpby1 = ocean_buoys.groupby_time(timebucket_duration="2cy", ... value_expression="buoyid", fill="NULLS") >>> ocean_buoys_grpby1.std().sort(["TIMECODE_RANGE", "buoyid"]) TIMECODE_RANGE GROUP BY TIME(CAL_YEARS(2)) buoyid std_salinity std_temperature 0 ('2014-01-01 00:00:00.000000-00:00', '2016-01-... 2 0 0.0 51.674462 1 ('2014-01-01 00:00:00.000000-00:00', '2016-01-... 2 1 0.0 3.937004 2 ('2014-01-01 00:00:00.000000-00:00', '2016-01-... 2 2 0.0 1.000000 3 ('2014-01-01 00:00:00.000000-00:00', '2016-01-... 2 44 0.0 5.765725 >>> # # Time Series Aggregate Example 2: Executing std() function on DataFrame created on # non-sequenced PTI table. We will consider DISTINCT rows for the # columns while calculating the standard deviation. # # To use std() as Time Series Aggregate we must run groupby_time() first, followed by std(). >>> ocean_buoys_grpby1 = ocean_buoys.groupby_time(timebucket_duration="2cy", ... value_expression="buoyid", fill="NULLS") >>> ocean_buoys_grpby1.std(distinct = True).sort(["TIMECODE_RANGE", "buoyid"]) TIMECODE_RANGE GROUP BY TIME(CAL_YEARS(2)) buoyid std_salinity std_temperature 0 ('2014-01-01 00:00:00.000000-00:00', '2016-01-... 2 0 None 51.675268 1 ('2014-01-01 00:00:00.000000-00:00', '2016-01-... 2 1 None 3.937004 2 ('2014-01-01 00:00:00.000000-00:00', '2016-01-... 2 2 None 1.000000 3 ('2014-01-01 00:00:00.000000-00:00', '2016-01-... 2 44 None 5.263079 >>> # # Time Series Aggregate Example 3: Executing std() function on DataFrame created on # non-sequenced PTI table. We shall calculate the # standard deviation on entire population, with # all non-null data points considered for calculations. # # To use std() as Time Series Aggregate we must run groupby_time() first, followed by std(). # To calculate population standard deviation we must set population=True. # >>> ocean_buoys_grpby1 = ocean_buoys.groupby_time(timebucket_duration="2cy", ... value_expression="buoyid", fill="NULLS") >>> ocean_buoys_grpby1.std(population=True).sort(["TIMECODE_RANGE", "buoyid"]) TIMECODE_RANGE GROUP BY TIME(CAL_YEARS(2)) buoyid std_salinity std_temperature 0 ('2014-01-01 00:00:00.000000-00:00', '2016-01-... 2 0 0.0 44.751397 1 ('2014-01-01 00:00:00.000000-00:00', '2016-01-... 2 1 0.0 3.593976 2 ('2014-01-01 00:00:00.000000-00:00', '2016-01-... 2 2 0.0 0.816497 3 ('2014-01-01 00:00:00.000000-00:00', '2016-01-... 2 44 0.0 5.539530 >>>