describe() | Teradata Package for Python - 17.00 - describe() Method - Teradata Package for Python

Teradata® Package for Python User Guide

Product
Teradata Package for Python
Release Number
17.00
Release Date
November 2021
Content Type
User Guide
Publication ID
B700-4006-070K
Language
English (United States)
The describe() function generates statistics for numeric columns. This function can be used in two modes:
  • Regular Aggregate Mode

    It computes the count, mean, std, min, percentiles, and max for numeric columns.

    Default statistics include: "count", "mean", "std", "min", "percentile", "max".

    If describe() is used on the output of any DataFrame API or groupby(), then it is used in regular aggregate mode.
  • Time Series Aggregate Mode

    It computes max, mean, min, std, median, mode, and percentiles for numeric columns.

    Default statistics include: 'max', 'mean', 'min', 'std'

    If describe() is used on the output of groupby_time(), then it is used in time series aggregate mode, where time series aggregates are used to calculate the statistics.
Optional arguments:
  • percentiles: A list of values between 0 and 1 used for computing percentiles.

    The default value is [.25, .5, .75], which generates the 25th, 50th, and 75th percentiles.

  • include: The values for this argument can be either 'None' or 'all', used to specify if non-numeric columns are included in the computation.
    • If the value is 'all': Both numeric and non-numeric columns are included. The function computes count, mean, std, min, percentiles, and max for numeric columns, and computes count and unique for non-numeric columns.
    • If the value is 'None': Only numeric columns are used for collecting statics.
    The default value is 'None'.
    Value 'all' is not applicable for Time Series Aggregate Mode.
  • verbose: Specifies a boolean value to be used for time series aggregation, stating whether to get verbose output or not. When this argument is set to 'True', function calculates median, mode, and percentile values on top of its default statistics.

    Default and the only acceptable value for this argument when used in Regular Aggregate Mode is 'False'.

    verbose as 'True' is not applicable for Regular Aggregate Mode.

  • distinct: Specifies a boolean value to decide whether to consider duplicate rows in statistic calculation or not.

    By default, this argument is set to 'False', which means that duplicate values are considered for statistic calculation.

    When this is set to 'True', only distinct rows are considered for statistic calculation.

1. Examples for describe() as regular function or aggregate function

Example Prerequisite

>>> df = DataFrame('sales')
>>> df
              Feb   Jan   Mar   Apr    datetime
accounts                                      
Alpha Co    210.0   200   215   250  04/01/2017
Red Inc     200.0   150   140  None  04/01/2017
Orange Inc  210.0  None  None   250  04/01/2017
Jones LLC   200.0   150   140   180  04/01/2017
Yellow Inc   90.0  None  None  None  04/01/2017
Blue Inc     90.0    50    95   101  04/01/2017

Example 1.1: Generates statistics for DataFrame "sales"

Use default values to computes count, mean, std, min, percentiles, and max for numeric columns.

>>> df.describe()
          Apr      Feb     Mar     Jan
func
count       4        6       4       4
mean   195.25  166.667   147.5   137.5
std    70.971   59.554  49.749  62.915
min       101       90      95      50
25%    160.25    117.5  128.75     125
50%       215      200     140     150
75%       250    207.5  158.75   162.5
max       250      210     215     200

Example 1.2: Use argument percentiles to compute the 30th and 60th percentiles

>>> df.describe(percentiles=[.3, .6])
          Apr      Feb     Mar     Jan
func
count       4        6       4       4
mean   195.25  166.667   147.5   137.5
std    70.971   59.554  49.749  62.915
min       101       90      95      50
30%     172.1      145   135.5     140
60%       236      200     140     150
max       250      210     215     200

Example 1.3: Use groupby to compute statistics for specific groups

>>> df1 = df.groupby(["datetime", "Feb"])
>>> df1.describe()
                         Jan   Mar   Apr
datetime   Feb   func                  
04/01/2017 90.0  25%      50    95   101
                 50%      50    95   101
                 75%      50    95   101
                 count     1     1     1
                 max      50    95   101
                 mean     50    95   101
                 min      50    95   101
                 std    None  None  None
           200.0 25%     150   140   180
                 50%     150   140   180
                 75%     150   140   180
                 count     2     2     1
                 max     150   140   180
                 mean    150   140   180
                 min     150   140   180
                 std       0     0  None
           210.0 25%     200   215   250
                 50%     200   215   250
                 75%     200   215   250
                 count     1     1     2
                 max     200   215   250
                 mean    200   215   250
                 min     200   215   250
                 std    None  None     0

Example 1.4: Use argument include value 'all' to compute statistics for all columns

Computes count, mean, std, min, percentiles, and max for numeric columns and computes count and unique for non-numeric columns.

>>> df.describe(include="all")
       accounts      Feb     Jan     Mar     Apr datetime
func                                                    
25%        None    117.5     125  128.75  160.25     None
75%        None    207.5   162.5  158.75     250     None
count         6        6       4       4       4        6
mean       None  166.667   137.5   147.5  195.25     None
max        None      210     200     215     250     None
min        None       90      50      95     101     None
50%        None      200     150     140     215     None
std        None   59.554  62.915  49.749  70.971     None
unique        6     None    None    None    None        1

2. Examples for describe() as Time Series Aggregate function

Examples Prerequisite

See Example Setup to set up the environment for the following examples.

Example 2.1: Get the basic statistics

Get the basic statistics for time series aggregation for all the numeric columns, use default settings. This example returns max, mean, min and std values.

>>> ocean_buoys_grpby.describe()
                                                                                           temperature salinity
TIMECODE_RANGE                                     GROUP BY TIME(CAL_YEARS(2)) buoyid func
('2014-01-01 00:00:00.000000-00:00', '2016-01-0... 2                           0      max          100       55
                                                                                      mean       54.75       55
                                                                                      min           10       55
                                                                                      std       51.674        0
                                                                               1      max           79       55
                                                                                      mean        74.5       55
                                                                                      min           70       55
                                                                                      std        3.937        0
                                                                               2      max           82       55
                                                                                      mean          81       55
                                                                                      min           80       55
                                                                                      std            1        0
                                                                               44     max           56       55
                                                                                      mean      48.077       55
                                                                                      min           43       55
                                                                                      std        5.766        0

Example 2.2: Get the verbose statistics

Get the verbose statistics for time series aggregation for all the numeric columns, use default settings. This example returns max, mean, min, std, median, mode, 25th, 50th and 75th percentile.

>>> ocean_buoys_grpby.describe(verbose=True)
                                                                                             temperature salinity
TIMECODE_RANGE                                     GROUP BY TIME(CAL_YEARS(2)) buoyid func
('2014-01-01 00:00:00.000000-00:00', '2016-01-0... 2                           0      25%             10       55
                                                                                      50%           54.5       55
                                                                                      75%          99.25       55
                                                                                      max            100       55
                                                                                      mean         54.75       55
                                                                                      median        54.5       55
                                                                                      min             10       55
                                                                                      mode            10       55
                                                                                      std         51.674        0
                                                                               1      25%          71.25       55
                                                                                      50%           74.5       55
                                                                                      75%          77.75       55
                                                                                      max             79       55
                                                                                      mean          74.5       55
                                                                                      median        74.5       55
                                                                                      min             70       55
                                                                                      mode            71       55
                                                                                      mode            72       55
                                                                                      mode            77       55
                                                                                      mode            78       55
                                                                                      mode            79       55
                                                                                      mode            70       55
                                                                                      std          3.937        0
                                                                               2      25%           80.5       55
                                                                                      50%             81       55
                                                                                      75%           81.5       55
                                                                                      max             82       55
                                                                                      mean            81       55
                                                                                      median          81       55
                                                                                      min             80       55
                                                                                      mode            80       55
                                                                                      mode            81       55
                                                                                      mode            82       55
                                                                                      std              1        0
                                                                               44     25%             43       55
                                                                                      50%             43       55
                                                                                      75%             53       55
                                                                                      max             56       55
                                                                                      mean        48.077       55
                                                                                      median          43       55
                                                                                      min             43       55
                                                                                      mode            43       55
                                                                                      std          5.766        0

Example 2.3: Get the basic statistics, consider only unique values

Get the basic statistics for time series aggregation for all the numeric columns, consider only unique values. This example returns max, mean, min and std values.

>>> ocean_buoys_grpby.describe(distinct=True)
                                                                                           temperature salinity
TIMECODE_RANGE                                     GROUP BY TIME(CAL_YEARS(2)) buoyid func
('2014-01-01 00:00:00.000000-00:00', '2016-01-0... 2                           0      max          100       55
                                                                                      mean      69.667       55
                                                                                      min           10       55
                                                                                      std       51.675     None
                                                                               1      max           79       55
                                                                                      mean        74.5       55
                                                                                      min           70       55
                                                                                      std        3.937     None
                                                                               2      max           82       55
                                                                                      mean          81       55
                                                                                      min           80       55
                                                                                      std            1     None
                                                                               44     max           56       55
                                                                                      mean        52.2       55
                                                                                      min           43       55
                                                                                      std        5.263     None

Example 2.4: Get the verbose statistics, select non-default percentiles

Get the verbose statistics for time series aggregation for all the numeric columns. In this example, you select non-default percentiles 33rd and 66th. This example returns max, mean, min, std, median, mode, 33rd, and 66th percentile.

>>> ocean_buoys_grpby.describe(verbose=True, percentiles=[0.33, 0.66])
                                                                                             temperature salinity
TIMECODE_RANGE                                     GROUP BY TIME(CAL_YEARS(2)) buoyid func
('2014-01-01 00:00:00.000000-00:00', '2016-01-0... 2                           0      33%             10       55
                                                                                      66%          97.22       55
                                                                                      max            100       55
                                                                                      mean         54.75       55
                                                                                      median        54.5       55
                                                                                      min             10       55
                                                                                      mode            10       55
                                                                                      std         51.674        0
                                                                               1      33%          71.65       55
                                                                                      66%           77.3       55
                                                                                      max             79       55
                                                                                      mean          74.5       55
                                                                                      median        74.5       55
                                                                                      min             70       55
                                                                                      mode            70       55
                                                                                      mode            71       55
                                                                                      mode            77       55
                                                                                      mode            78       55
                                                                                      mode            79       55
                                                                                      mode            72       55
                                                                                      std          3.937        0
                                                                               2      33%          80.66       55
                                                                                      66%          81.32       55
                                                                                      max             82       55
                                                                                      mean            81       55
                                                                                      median          81       55
                                                                                      min             80       55
                                                                                      mode            80       55
                                                                                      mode            81       55
                                                                                      mode            82       55
                                                                                      std              1        0
                                                                               44     33%             43       55
                                                                                      66%             53       55
                                                                                      max             56       55
                                                                                      mean        48.077       55
                                                                                      median          43       55
                                                                                      min             43       55
                                                                                      mode            43       55
                                                                                      std          5.766        0