Teradata Package for Python Function Reference | 17.10 - var - Teradata Package for Python - Look here for syntax, methods and examples for the functions included in the Teradata Package for Python.
Teradata® Package for Python Function Reference
- Product
- Teradata Package for Python
- Release Number
- 17.10
- Published
- April 2022
- Language
- English (United States)
- Last Update
- 2022-08-19
- lifecycle
- previous
- Product Category
- Teradata Vantage
- teradataml.dataframe.dataframe.DataFrame.var = var(self, distinct=False, population=False)
- DESCRIPTION:
Returns column-wise sample or population variance of the columns in a
dataframe.
* The variance of a population is a measure of dispersion from the
mean of that population.
* The variance of a sample is a measure of dispersion from the mean
of that sample. It is the square of the sample standard deviation.
Note:
1. When there are fewer than two non-null data points in the sample used
for the computation, then var returns None.
2. Null values are not included in the result computation.
3. If data represents only a sample of the entire population for the
columns, Teradata recommends to calculate sample variance,
otherwise calculate population variance.
PARAMETERS:
distinct:
Optional Argument.
Specifies whether to exclude duplicate column values while calculating the
variance value.
Default Values: False
Types: bool
population:
Optional Argument.
Specifies whether to calculate variance on entire population or not.
Set this argument to True only when the data points represent the complete
population. If your data represents only a sample of the entire population
for the columns, then set this variable to False, which will compute the
sample variance. As the sample size increases, even though the values for
sample variance and population variance approach the same number, but you
should always use the more conservative sample standard deviation calculation,
unless you are absolutely certain that your data constitutes the entire
population for the columns.
Default Value: False
Types: bool
RETURNS:
teradataml DataFrame object with var() operation performed.
RAISES:
1. TDMLDF_AGGREGATE_FAILED - If var() operation fails to
generate the column-wise variance of the dataframe.
Possible error message:
Unable to perform 'var()' on the dataframe.
2. TDMLDF_AGGREGATE_COMBINED_ERR - If the var() operation
doesn't support all the columns in the dataframe.
Possible error message:
No results. Below is/are the error message(s):
All selected columns [(col2 - PERIOD_TIME), (col3 -
BLOB)] is/are unsupported for 'var' operation.
EXAMPLES :
# Load the data to run the example.
>>> from teradataml.data.load_example_data import load_example_data
>>> load_example_data("dataframe", ["employee_info", "sales"])
# Example 1 - Applying var on table 'employee_info' that has all
# NULL values in marks and dob columns which are
# captured as None in variance dataframe.
# Create teradataml dataframe.
>>> df1 = DataFrame("employee_info")
>>> print(df1)
first_name marks dob joined_date
employee_no
101 abcde None None 02/12/05
100 abcd None None None
112 None None None 18/12/05
>>>
# Select only subset of columns from the DataFrame.
>>> df3 = df1.select(["employee_no", "first_name", "dob", "marks"])
# Prints unbiased variance of each column(with supported data types).
>>> df3.var()
var_employee_no var_dob var_marks
0 44.333333 None None
# Example 2 - Applying var on table 'sales' that has different
# types of data like floats, integers, strings
# some of which having NULL values which are ignored.
# Create teradataml dataframe.
>>> df1 = DataFrame("sales")
>>> print(df1)
Feb Jan Mar Apr datetime
accounts
Blue Inc 90.0 50 95 101 04/01/2017
Orange Inc 210.0 None None 250 04/01/2017
Red Inc 200.0 150 140 None 04/01/2017
Yellow Inc 90.0 None None None 04/01/2017
Jones LLC 200.0 150 140 180 04/01/2017
Alpha Co 210.0 200 215 250 04/01/2017
# Prints unbiased sample variance of each column(with supported data types).
>>> df3 = df1.select(["accounts","Feb","Jan","Mar","Apr"])
>>> df3.var()
var_Feb var_Jan var_Mar var_Apr
0 3546.666667 3958.333333 2475.0 5036.916667
>>>
# Prints population variance of each column(with supported data types).
>>> df3.var(population=True)
var_Feb var_Jan var_Mar var_Apr
0 2955.555556 2968.75 1856.25 3777.6875
>>>
#
# Using var() as Time Series Aggregate.
#
>>> # Load the example datasets.
... load_example_data("dataframe", ["ocean_buoys"])
>>>
#
# Time Series Aggregate Example 1: Executing var() function on DataFrame created on
# non-sequenced PTI table. We will consider all rows for the
# columns while calculating the variance value.
#
>>> # Create the required DataFrames.
... # DataFrame on non-sequenced PTI table
... ocean_buoys = DataFrame("ocean_buoys")
>>> # Check DataFrame columns and let's peek at the data
... ocean_buoys.columns
['buoyid', 'TD_TIMECODE', 'temperature', 'salinity']
>>> ocean_buoys.head()
TD_TIMECODE temperature salinity
buoyid
0 2014-01-06 08:10:00.000000 100.0 55
0 2014-01-06 08:08:59.999999 NaN 55
1 2014-01-06 09:01:25.122200 77.0 55
1 2014-01-06 09:03:25.122200 79.0 55
1 2014-01-06 09:01:25.122200 70.0 55
1 2014-01-06 09:02:25.122200 71.0 55
1 2014-01-06 09:03:25.122200 72.0 55
0 2014-01-06 08:09:59.999999 99.0 55
0 2014-01-06 08:00:00.000000 10.0 55
0 2014-01-06 08:10:00.000000 10.0 55
# To use var() as Time Series Aggregate we must run groupby_time() first, followed by var().
>>> ocean_buoys_grpby1 = ocean_buoys.groupby_time(timebucket_duration="2cy",
... value_expression="buoyid", fill="NULLS")
>>> ocean_buoys_grpby1.var().sort(["TIMECODE_RANGE", "buoyid"])
TIMECODE_RANGE GROUP BY TIME(CAL_YEARS(2)) buoyid var_salinity var_temperature
0 ('2014-01-01 00:00:00.000000-00:00', '2016-01-... 2 0 0.0 2670.25000
1 ('2014-01-01 00:00:00.000000-00:00', '2016-01-... 2 1 0.0 15.50000
2 ('2014-01-01 00:00:00.000000-00:00', '2016-01-... 2 2 0.0 1.00000
3 ('2014-01-01 00:00:00.000000-00:00', '2016-01-... 2 44 0.0 33.24359
>>>
#
# Time Series Aggregate Example 2: Executing var() function on DataFrame created on
# non-sequenced PTI table. We will consider DISTINCT rows for the
# columns while calculating the variance value.
#
# To use var() as Time Series Aggregate we must run groupby_time() first, followed by var().
>>> ocean_buoys_grpby1 = ocean_buoys.groupby_time(timebucket_duration="2cy",
... value_expression="buoyid", fill="NULLS")
>>> ocean_buoys_grpby1.var(distinct = True).sort(["TIMECODE_RANGE", "buoyid"])
TIMECODE_RANGE GROUP BY TIME(CAL_YEARS(2)) buoyid var_salinity var_temperature
0 ('2014-01-01 00:00:00.000000-00:00', '2016-01-... 2 0 None 2670.333333
1 ('2014-01-01 00:00:00.000000-00:00', '2016-01-... 2 1 None 15.500000
2 ('2014-01-01 00:00:00.000000-00:00', '2016-01-... 2 2 None 1.000000
3 ('2014-01-01 00:00:00.000000-00:00', '2016-01-... 2 44 None 27.700000
>>>
#
# Time Series Aggregate Example 3: Executing var() function on DataFrame created on
# non-sequenced PTI table. We shall calculate the
# variance on entire population, with all non-null
# data points considered for calculations.
#
# To use var() as Time Series Aggregate we must run groupby_time() first, followed by var().
# To calculate population variance we must set population=True.
#
>>> ocean_buoys_grpby1 = ocean_buoys.groupby_time(timebucket_duration="2cy",
... value_expression="buoyid", fill="NULLS")
>>> ocean_buoys_grpby1.var(population=True).sort(["TIMECODE_RANGE", "buoyid"])
TIMECODE_RANGE GROUP BY TIME(CAL_YEARS(2)) buoyid var_salinity var_temperature
0 ('2014-01-01 00:00:00.000000-00:00', '2016-01-... 2 0 0.0 2002.687500
1 ('2014-01-01 00:00:00.000000-00:00', '2016-01-... 2 1 0.0 12.916667
2 ('2014-01-01 00:00:00.000000-00:00', '2016-01-... 2 2 0.0 0.666667
3 ('2014-01-01 00:00:00.000000-00:00', '2016-01-... 2 44 0.0 30.686391
>>>