Teradata Package for Python Function Reference | 20.00 - describe - Teradata Package for Python - Look here for syntax, methods and examples for the functions included in the Teradata Package for Python.
Teradata® Package for Python Function Reference - 20.00
- Deployment
- VantageCloud
- VantageCore
- Edition
- Enterprise
- IntelliFlex
- VMware
- Product
- Teradata Package for Python
- Release Number
- 20.00.00.03
- Published
- December 2024
- ft:locale
- en-US
- ft:lastEdition
- 2024-12-19
- dita:id
- TeradataPython_FxRef_Enterprise_2000
- Product Category
- Teradata Vantage
- teradataml.dataframe.dataframe.DataFrame.describe = describe(self, percentiles=[0.25, 0.5, 0.75], include=None, verbose=False, distinct=False, statistics=None, columns=None)
- DESCRIPTION:
Generates statistics for numeric columns. This function can be used in two modes:
1. Regular Aggregate Mode.
It computes the count, mean, std, min, percentiles, and max for numeric columns.
Default statistics include:
"count", "mean", "std", "min", "percentile", "max"
2. Time Series Aggregate Mode.
It computes max, mean, min, std, median, mode, and percentiles for numeric columns.
Default statistics include:
'max', 'mean', 'min', 'std'
Note:
Regular Aggregate Mode: If describe() is used on the output of any DataFrame API or groupby(),
then describe() is used as regular aggregation.
Time Series Aggregate Mode: If describe() is used on the output of groupby_time(), then describe()
is a time series aggregate, where time series aggregates are used
to calculate the statistics.
PARAMETERS:
percentiles:
Optional Argument.
A list of values between 0 and 1. Applicable for both modes.
By default, percentiles are calculated for statistics for 'Regular Aggregate Mode', whereas
for 'Time Series Aggregate Mode', percentiles are calculated when verbose is set to True.
Default Values: [.25, .5, .75], which returns the 25th, 50th, and 75th percentiles.
Types: float or List of floats
include:
Optional Argument.
Values can be either None or "all".
If the value is "all", then both numeric and non-numeric columns are included.
Computes count, mean, std, min, percentiles, and max for numeric columns.
Computes count and unique for non-numeric columns.
If the value is None, only numeric columns are used for collecting statistics.
Note:
Value 'all' is not applicable for 'Time Series Aggregate Mode'.
Default Values: None
Types: str
verbose:
Optional Argument.
Specifies a boolean value to be used for time series aggregation, stating whether to get
verbose output or not.
When this argument is set to 'True', function calculates median, mode, and percentile values
on top of its default statistics.
Note:
verbose as 'True' is not applicable for 'Regular Aggregate Mode'.
Default Values: False
Types: bool
distinct:
Optional Argument.
Specifies a boolean value to decide whether to consider duplicate rows in statistic
calculation or not. By default, duplicate values are considered for statistic calculation.
When this is set to True, only distinct rows are considered for statistic calculation.
Default Values: False
Types: bool
statistics:
Optional Argument.
Specifies the aggregate operation to be performed.
Computes count, mean, std, min, percentiles, and max for numeric columns.
Computes count and unique for non-numeric columns.
Notes:
1. statistics is not applicable for 'Time Series Aggregate Mode'.
2. statistics should not be used with include as 'all'.
Permitted Values: count, mean, min, max, unique, std, describe, percentile
Default Values: None
Types: str or List of str
columns:
Optional Argument.
Specifies the name(s) of the columns we are collecting statistics for.
Default Values: None
Types: str or List of str
RETURNS:
teradataml DataFrame
RAISE:
TeradataMlException
EXAMPLES:
>>> load_example_data("dataframe","sales")
>>> df = DataFrame('sales')
>>> print(df)
Feb Jan Mar Apr datetime
accounts
Blue Inc 90.0 50 95 101 04/01/2017
Alpha Co 210.0 200 215 250 04/01/2017
Jones LLC 200.0 150 140 180 04/01/2017
Yellow Inc 90.0 None None None 04/01/2017
Red Inc 200.0 150 140 None 04/01/2017
Orange Inc 210.0 None None 250 04/01/2017
# Computes count, mean, std, min, percentiles, and max for numeric columns.
>>> df.describe()
Apr Feb Mar Jan
func
count 4 6 4 4
mean 195.25 166.667 147.5 137.5
std 70.971 59.554 49.749 62.915
min 101 90 95 50
25% 160.25 117.5 128.75 125
50% 215 200 140 150
75% 250 207.5 158.75 162.5
max 250 210 215 200
# Computes count, mean, std, min, percentiles, and max for numeric columns with 30th and 60th percentiles.
>>> df.describe(percentiles=[.3, .6])
Apr Feb Mar Jan
func
count 4 6 4 4
mean 195.25 166.667 147.5 137.5
std 70.971 59.554 49.749 62.915
min 101 90 95 50
30% 172.1 145 135.5 140
60% 236 200 140 150
max 250 210 215 200
# Computes count, mean, std, min, percentiles, and max for numeric columns group by "datetime" and "Feb".
>>> df1 = df.groupby(["datetime", "Feb"])
>>> df1.describe()
Jan Mar Apr
datetime Feb func
04/01/2017 90.0 25% 50 95 101
50% 50 95 101
75% 50 95 101
count 1 1 1
max 50 95 101
mean 50 95 101
min 50 95 101
std None None None
200.0 25% 150 140 180
50% 150 140 180
75% 150 140 180
count 2 2 1
max 150 140 180
mean 150 140 180
min 150 140 180
std 0 0 None
210.0 25% 200 215 250
50% 200 215 250
75% 200 215 250
count 1 1 2
max 200 215 250
mean 200 215 250
min 200 215 250
std None None 0
# Computes count, mean, std, min, percentiles, and max for numeric columns and
# computes count and unique for non-numeric columns
>>> df.describe(include="all")
accounts Feb Jan Mar Apr datetime
func
25% None 117.5 125 128.75 160.25 None
75% None 207.5 162.5 158.75 250 None
count 6 6 4 4 4 6
mean None 166.667 137.5 147.5 195.25 None
max None 210 200 215 250 None
min None 90 50 95 101 None
50% None 200 150 140 215 None
std None 59.554 62.915 49.749 70.971 None
unique 6 None None None None 1
#
# Examples for describe() function as Time Series Aggregate.
#
>>> # Load the example datasets.
... load_example_data("dataframe", ["ocean_buoys"])
>>>
>>> # Create the required DataFrames.
... # DataFrame on non-sequenced PTI table
... ocean_buoys = DataFrame("ocean_buoys")
>>> # Check DataFrame columns and let's peek at the data
... ocean_buoys.columns
['buoyid', 'TD_TIMECODE', 'temperature', 'salinity']
>>> ocean_buoys.head()
TD_TIMECODE temperature salinity
buoyid
0 2014-01-06 08:10:00.000000 100.0 55
0 2014-01-06 08:08:59.999999 NaN 55
1 2014-01-06 09:01:25.122200 77.0 55
1 2014-01-06 09:03:25.122200 79.0 55
1 2014-01-06 09:01:25.122200 70.0 55
1 2014-01-06 09:02:25.122200 71.0 55
1 2014-01-06 09:03:25.122200 72.0 55
0 2014-01-06 08:09:59.999999 99.0 55
0 2014-01-06 08:00:00.000000 10.0 55
0 2014-01-06 08:10:00.000000 10.0 55
>>> ocean_buoys_grpby = ocean_buoys.groupby_time(timebucket_duration="2cy", value_expression="buoyid", fill="NULLS")
>>>
#
# Example 1: Get the basic statistics for time series aggregation for all the numeric columns.
# This returns max, mean, min and std values.
#
>>> ocean_buoys_grpby.describe()
temperature salinity
TIMECODE_RANGE GROUP BY TIME(CAL_YEARS(2)) buoyid func
('2014-01-01 00:00:00.000000-00:00', '2016-01-0... 2 0 max 100 55
mean 54.75 55
min 10 55
std 51.674 0
1 max 79 55
mean 74.5 55
min 70 55
std 3.937 0
2 max 82 55
mean 81 55
min 80 55
std 1 0
44 max 56 55
mean 48.077 55
min 43 55
std 5.766 0
>>>
#
# Example 2: Get the verbose statistics for time series aggregation for all the numeric columns.
# This returns max, mean, min, std, median, mode, 25th, 50th and 75th percentile.
#
>>> ocean_buoys_grpby.describe(verbose=True)
temperature salinity
TIMECODE_RANGE GROUP BY TIME(CAL_YEARS(2)) buoyid func
('2014-01-01 00:00:00.000000-00:00', '2016-01-0... 2 0 25% 10 55
50% 54.5 55
75% 99.25 55
max 100 55
mean 54.75 55
median 54.5 55
min 10 55
mode 10 55
std 51.674 0
1 25% 71.25 55
50% 74.5 55
75% 77.75 55
max 79 55
mean 74.5 55
median 74.5 55
min 70 55
mode 71 55
mode 72 55
mode 77 55
mode 78 55
mode 79 55
mode 70 55
std 3.937 0
2 25% 80.5 55
50% 81 55
75% 81.5 55
max 82 55
mean 81 55
median 81 55
min 80 55
mode 80 55
mode 81 55
mode 82 55
std 1 0
44 25% 43 55
50% 43 55
75% 53 55
max 56 55
mean 48.077 55
median 43 55
min 43 55
mode 43 55
std 5.766 0
>>>
#
# Example 3: Get the basic statistics for time series aggregation for all the numeric columns,
# consider only unique values.
# This returns max, mean, min and std values.
#
>>> ocean_buoys_grpby.describe(distinct=True)
temperature salinity
TIMECODE_RANGE GROUP BY TIME(CAL_YEARS(2)) buoyid func
('2014-01-01 00:00:00.000000-00:00', '2016-01-0... 2 0 max 100 55
mean 69.667 55
min 10 55
std 51.675 None
1 max 79 55
mean 74.5 55
min 70 55
std 3.937 None
2 max 82 55
mean 81 55
min 80 55
std 1 None
44 max 56 55
mean 52.2 55
min 43 55
std 5.263 None
>>>
#
# Example 4: Get the verbose statistics for time series aggregation for all the numeric columns.
# This select non-default percentiles 33rd and 66th.
# This returns max, mean, min, std, median, mode, 33rd, and 66th percentile.
#
>>> ocean_buoys_grpby.describe(verbose=True, percentiles=[0.33, 0.66])
temperature salinity
TIMECODE_RANGE GROUP BY TIME(CAL_YEARS(2)) buoyid func
('2014-01-01 00:00:00.000000-00:00', '2016-01-0... 2 0 33% 10 55
66% 97.22 55
max 100 55
mean 54.75 55
median 54.5 55
min 10 55
mode 10 55
std 51.674 0
1 33% 71.65 55
66% 77.3 55
max 79 55
mean 74.5 55
median 74.5 55
min 70 55
mode 70 55
mode 71 55
mode 77 55
mode 78 55
mode 79 55
mode 72 55
std 3.937 0
2 33% 80.66 55
66% 81.32 55
max 82 55
mean 81 55
median 81 55
min 80 55
mode 80 55
mode 81 55
mode 82 55
std 1 0
44 33% 43 55
66% 53 55
max 56 55
mean 48.077 55
median 43 55
min 43 55
mode 43 55
std 5.766 0
>>>