Teradata Package for Python Function Reference | 20.00 - std - Teradata Package for Python - Look here for syntax, methods and examples for the functions included in the Teradata Package for Python.
Teradata® Package for Python Function Reference - 20.00
- Deployment
- VantageCloud
- VantageCore
- Edition
- Enterprise
- IntelliFlex
- VMware
- Product
- Teradata Package for Python
- Release Number
- 20.00.00.03
- Published
- December 2024
- ft:locale
- en-US
- ft:lastEdition
- 2024-12-19
- dita:id
- TeradataPython_FxRef_Enterprise_2000
- Product Category
- Teradata Vantage
- teradataml.dataframe.dataframe.DataFrame.std = std(self, distinct=False, population=False)
- DESCRIPTION:
Returns column-wise sample or population standard deviation value of the
dataframe. The standard deviation is the second moment of a distribution.
* For a sample, it is a measure of dispersion from the mean of that sample.
* For a population, it is a measure of dispersion from the mean of that population.
The computation is more conservative for the population standard deviation
to minimize the effect of outliers on the computed value.
Note:
1. When there are fewer than two non-null data points in the sample used
for the computation, then std returns None.
2. Null values are not included in the result computation.
3. If data represents only a sample of the entire population for the
columns, Teradata recommends to calculate sample standard deviation,
otherwise calculate population standard deviation.
PARAMETERS:
distinct:
Optional Argument.
Specifies whether to exclude duplicate values while calculating
the standard deviation.
Default Value: False
Types: bool
population:
Optional Argument.
Specifies whether to calculate standard deviation on entire population or not.
Set this argument to True only when the data points represent the complete
population. If your data represents only a sample of the entire population for the
columns, then set this variable to False, which will compute the sample standard
deviation. As the sample size increases, even though the values for sample
standard deviation and population standard deviation approach the same number,
you should always use the more conservative sample standard deviation calculation,
unless you are absolutely certain that your data constitutes the entire population
for the columns.
Default Value: False
Types: bool
RETURNS:
teradataml DataFrame object with std() operation performed.
RAISES:
1. EXECUTION_FAILED - If std() operation fails to
generate the column-wise standard deviation of the
dataframe.
Possible error message:
Failed to perform 'std'. (Followed by error message)
2. TDMLDF_AGGREGATE_COMBINED_ERR - If the std() operation
doesn't support all the columns in the dataframe.
Possible error message:
No results. Below is/are the error message(s):
All selected columns [(col2 - PERIOD_TIME), (col3 -
BLOB)] is/are unsupported for 'std' operation.
EXAMPLES :
# Load the data to run the example.
>>> from teradataml.data.load_example_data import load_example_data
>>> load_example_data("dataframe", ["employee_info"])
# Create teradataml dataframe.
>>> df1 = DataFrame("employee_info")
>>> print(df1)
first_name marks dob joined_date
employee_no
101 abcde None None 02/12/05
100 abcd None None None
112 None None None 18/12/05
>>>
# Select only subset of columns from the DataFrame.
>>> df2 = df1.select(['employee_no', 'first_name', 'marks', 'joined_date'])
# Prints sample standard deviation of each column(with supported data types).
>>> df2.std()
std_employee_no std_marks std_joined_date
0 6.658328 None 82/03/09
>>>
# Prints population standard deviation of each column(with supported data types).
>>> df2.std(population=True)
std_employee_no std_marks std_joined_date
0 5.436502 None 58/02/28
>>>
#
# Using std() as Time Series Aggregate.
#
>>> # Load the example datasets.
... load_example_data("dataframe", ["ocean_buoys"])
>>>
#
# Time Series Aggregate Example 1: Executing std() function on DataFrame created on
# non-sequenced PTI table. We will consider all rows for the
# columns while calculating the standard deviation.
#
>>> # Create the required DataFrames.
... # DataFrame on non-sequenced PTI table
... ocean_buoys = DataFrame("ocean_buoys")
>>> # Check DataFrame columns and let's peek at the data
... ocean_buoys.columns
['buoyid', 'TD_TIMECODE', 'temperature', 'salinity']
>>> ocean_buoys.head()
TD_TIMECODE temperature salinity
buoyid
0 2014-01-06 08:10:00.000000 100.0 55
0 2014-01-06 08:08:59.999999 NaN 55
1 2014-01-06 09:01:25.122200 77.0 55
1 2014-01-06 09:03:25.122200 79.0 55
1 2014-01-06 09:01:25.122200 70.0 55
1 2014-01-06 09:02:25.122200 71.0 55
1 2014-01-06 09:03:25.122200 72.0 55
0 2014-01-06 08:09:59.999999 99.0 55
0 2014-01-06 08:00:00.000000 10.0 55
0 2014-01-06 08:10:00.000000 10.0 55
# To use std() as Time Series Aggregate we must run groupby_time() first, followed by std().
>>> ocean_buoys_grpby1 = ocean_buoys.groupby_time(timebucket_duration="2cy",
... value_expression="buoyid", fill="NULLS")
>>> ocean_buoys_grpby1.std().sort(["TIMECODE_RANGE", "buoyid"])
TIMECODE_RANGE GROUP BY TIME(CAL_YEARS(2)) buoyid std_salinity std_temperature
0 ('2014-01-01 00:00:00.000000-00:00', '2016-01-... 2 0 0.0 51.674462
1 ('2014-01-01 00:00:00.000000-00:00', '2016-01-... 2 1 0.0 3.937004
2 ('2014-01-01 00:00:00.000000-00:00', '2016-01-... 2 2 0.0 1.000000
3 ('2014-01-01 00:00:00.000000-00:00', '2016-01-... 2 44 0.0 5.765725
>>>
#
# Time Series Aggregate Example 2: Executing std() function on DataFrame created on
# non-sequenced PTI table. We will consider DISTINCT rows for the
# columns while calculating the standard deviation.
#
# To use std() as Time Series Aggregate we must run groupby_time() first, followed by std().
>>> ocean_buoys_grpby1 = ocean_buoys.groupby_time(timebucket_duration="2cy",
... value_expression="buoyid", fill="NULLS")
>>> ocean_buoys_grpby1.std(distinct = True).sort(["TIMECODE_RANGE", "buoyid"])
TIMECODE_RANGE GROUP BY TIME(CAL_YEARS(2)) buoyid std_salinity std_temperature
0 ('2014-01-01 00:00:00.000000-00:00', '2016-01-... 2 0 None 51.675268
1 ('2014-01-01 00:00:00.000000-00:00', '2016-01-... 2 1 None 3.937004
2 ('2014-01-01 00:00:00.000000-00:00', '2016-01-... 2 2 None 1.000000
3 ('2014-01-01 00:00:00.000000-00:00', '2016-01-... 2 44 None 5.263079
>>>
#
# Time Series Aggregate Example 3: Executing std() function on DataFrame created on
# non-sequenced PTI table. We shall calculate the
# standard deviation on entire population, with
# all non-null data points considered for calculations.
#
# To use std() as Time Series Aggregate we must run groupby_time() first, followed by std().
# To calculate population standard deviation we must set population=True.
#
>>> ocean_buoys_grpby1 = ocean_buoys.groupby_time(timebucket_duration="2cy",
... value_expression="buoyid", fill="NULLS")
>>> ocean_buoys_grpby1.std(population=True).sort(["TIMECODE_RANGE", "buoyid"])
TIMECODE_RANGE GROUP BY TIME(CAL_YEARS(2)) buoyid std_salinity std_temperature
0 ('2014-01-01 00:00:00.000000-00:00', '2016-01-... 2 0 0.0 44.751397
1 ('2014-01-01 00:00:00.000000-00:00', '2016-01-... 2 1 0.0 3.593976
2 ('2014-01-01 00:00:00.000000-00:00', '2016-01-... 2 2 0.0 0.816497
3 ('2014-01-01 00:00:00.000000-00:00', '2016-01-... 2 44 0.0 5.539530
>>>