Teradata Python Package Function Reference - 16.20 - describe - Teradata Python Package

Teradata® Python Package Function Reference

prodname
Teradata Python Package
vrm_release
16.20
created_date
February 2020
category
Programming Reference
featnum
B700-4008-098K
teradataml.dataframe.dataframe.DataFrame.describe = describe(self, percentiles=[0.25, 0.5, 0.75], include=None, verbose=False, distinct=False)
DESCRIPTION:
    Generates statistics for numeric columns. This function can be used in two modes:
        1. Regular Aggregate Mode.
            It computes the count, mean, std, min, percentiles, and max for numeric columns.
            Default statistics include:
                "count", "mean", "std", "min", "percentile", "max"
        2. Time Series Aggregate Mode.
            It computes max, mean, min, std, median, mode, and percentiles for numeric columns.
            Default statistics include:
                'max', 'mean', 'min', 'std'
 
    Note:
        Regular Aggregate Mode: If describe() is used on the output of any DataFrame API or groupby(),
                                then describe() is used as regular aggregation.
        Time Series Aggregate Mode: If describe() is used on the output of groupby_time(), then describe()
                                    is a time series aggregate, where time series aggregates are used
                                    to calculate the statistics.
 
PARAMETERS:
    percentiles:
        Optional Argument.
        A list of values between 0 and 1. Applicable for both modes.
        By default, percentiles are calculated for statistics for 'Regular Aggregate Mode', whereas
        for 'Time Series Aggregate Mode', percentiles are calculated when verbose is set to True.
        Default Values: [.25, .5, .75], which returns the 25th, 50th, and 75th percentiles.
        Types: float or List of floats
 
    include:
        Optional Argument.
        Values can be either None or "all".
        If the value is "all", then both numeric and non-numeric columns are included.
        Computes count, mean, std, min, percentiles, and max for numeric columns.
        Computes count and unique for non-numeric columns.
        If the value is None, only numeric columns are used for collecting statistics.
        Note:
            Value 'all' is not applicable for 'Time Series Aggregate Mode'.
        Default Values: None
        Types: str
 
    verbose:
        Optional Argument.
        Specifies a boolean value to be used for time series aggregation, stating whether to get
        verbose output or not.
        When this argument is set to 'True', function calculates median, mode, and percentile values
        on top of its default statistics.
        Note:
            verbose as 'True' is not applicable for 'Regular Aggregate Mode'.
        Default Values: False
        Types: bool
 
    distinct:
        Optional Argument.
        Specifies a boolean value to decide whether to consider duplicate rows in statistic
        calculation or not. By default, duplicate values are considered for statistic calculation.
        When this is set to True, only distinct rows are considered for statistic calculation.
        Default Values: False
        Types: bool
 
RETURNS:
    teradataml DataFrame
 
RAISE:
    TeradataMlException
 
EXAMPLES:
    >>> load_example_data("dataframe","sales")
    >>> df = DataFrame('sales')
    >>> print(df)
                  Feb   Jan   Mar   Apr    datetime
    accounts
    Blue Inc     90.0    50    95   101  04/01/2017
    Alpha Co    210.0   200   215   250  04/01/2017
    Jones LLC   200.0   150   140   180  04/01/2017
    Yellow Inc   90.0  None  None  None  04/01/2017
    Red Inc     200.0   150   140  None  04/01/2017
    Orange Inc  210.0  None  None   250  04/01/2017
 
    # Computes count, mean, std, min, percentiles, and max for numeric columns.
    >>> df.describe()
              Apr      Feb     Mar     Jan
    func
    count       4        6       4       4
    mean   195.25  166.667   147.5   137.5
    std    70.971   59.554  49.749  62.915
    min       101       90      95      50
    25%    160.25    117.5  128.75     125
    50%       215      200     140     150
    75%       250    207.5  158.75   162.5
    max       250      210     215     200
 
    # Computes count, mean, std, min, percentiles, and max for numeric columns with 30th and 60th percentiles.
    >>> df.describe(percentiles=[.3, .6])
              Apr      Feb     Mar     Jan
    func
    count       4        6       4       4
    mean   195.25  166.667   147.5   137.5
    std    70.971   59.554  49.749  62.915
    min       101       90      95      50
    30%     172.1      145   135.5     140
    60%       236      200     140     150
    max       250      210     215     200
 
    # Computes count, mean, std, min, percentiles, and max for numeric columns group by "datetime" and "Feb".
    >>> df1 = df.groupby(["datetime", "Feb"])
    >>> df1.describe()
                             Jan   Mar   Apr
    datetime   Feb   func
    04/01/2017 90.0  25%      50    95   101
                     50%      50    95   101
                     75%      50    95   101
                     count     1     1     1
                     max      50    95   101
                     mean     50    95   101
                     min      50    95   101
                     std    None  None  None
               200.0 25%     150   140   180
                     50%     150   140   180
                     75%     150   140   180
                     count     2     2     1
                     max     150   140   180
                     mean    150   140   180
                     min     150   140   180
                     std       0     0  None
               210.0 25%     200   215   250
                     50%     200   215   250
                     75%     200   215   250
                     count     1     1     2
                     max     200   215   250
                     mean    200   215   250
                     min     200   215   250
                     std    None  None     0
 
    # Computes count, mean, std, min, percentiles, and max for numeric columns and
    # computes count and unique for non-numeric columns
    >>> df.describe(include="all")
           accounts      Feb     Jan     Mar     Apr datetime
    func
    25%        None    117.5     125  128.75  160.25     None
    75%        None    207.5   162.5  158.75     250     None
    count         6        6       4       4       4        6
    mean       None  166.667   137.5   147.5  195.25     None
    max        None      210     200     215     250     None
    min        None       90      50      95     101     None
    50%        None      200     150     140     215     None
    std        None   59.554  62.915  49.749  70.971     None
    unique        6     None    None    None    None        1
 
    #
    # Examples for describe() function as Time Series Aggregate.
    #
    >>> # Load the example datasets
    ... load_example_data("dataframe", ["ocean_buoys"])
    >>>
 
    >>> # Create the required DataFrames.
    ... # DataFrame on non-sequenced PTI table
    ... ocean_buoys = DataFrame("ocean_buoys")
    >>> # Check DataFrame columns and let's peek at the data
    ... ocean_buoys.columns
    ['buoyid', 'TD_TIMECODE', 'temperature', 'salinity']
    >>> ocean_buoys.head()
                           TD_TIMECODE  temperature  salinity
    buoyid
    0       2014-01-06 08:10:00.000000        100.0        55
    0       2014-01-06 08:08:59.999999          NaN        55
    1       2014-01-06 09:01:25.122200         77.0        55
    1       2014-01-06 09:03:25.122200         79.0        55
    1       2014-01-06 09:01:25.122200         70.0        55
    1       2014-01-06 09:02:25.122200         71.0        55
    1       2014-01-06 09:03:25.122200         72.0        55
    0       2014-01-06 08:09:59.999999         99.0        55
    0       2014-01-06 08:00:00.000000         10.0        55
    0       2014-01-06 08:10:00.000000         10.0        55
 
    >>> ocean_buoys_grpby = ocean_buoys.groupby_time(timebucket_duration="2cy", value_expression="buoyid", fill="NULLS")
    >>>
 
    #
    # Example 1: Get the basic statistics for time series aggregation for all the numeric columns.
    #            This returns max, mean, min and std values.
    #
    >>> ocean_buoys_grpby.describe()
                                                                                               temperature salinity
    TIMECODE_RANGE                                     GROUP BY TIME(CAL_YEARS(2)) buoyid func
    ('2014-01-01 00:00:00.000000-00:00', '2016-01-0... 2                           0      max          100       55
                                                                                          mean       54.75       55
                                                                                          min           10       55
                                                                                          std       51.674        0
                                                                                   1      max           79       55
                                                                                          mean        74.5       55
                                                                                          min           70       55
                                                                                          std        3.937        0
                                                                                   2      max           82       55
                                                                                          mean          81       55
                                                                                          min           80       55
                                                                                          std            1        0
                                                                                   44     max           56       55
                                                                                          mean      48.077       55
                                                                                          min           43       55
                                                                                          std        5.766        0
    >>>
 
    #
    # Example 2: Get the verbose statistics for time series aggregation for all the numeric columns.
    #            This returns max, mean, min, std, median, mode, 25th, 50th and 75th percentile.
    #
    >>> ocean_buoys_grpby.describe(verbose=True)
                                                                                                 temperature salinity
    TIMECODE_RANGE                                     GROUP BY TIME(CAL_YEARS(2)) buoyid func
    ('2014-01-01 00:00:00.000000-00:00', '2016-01-0... 2                           0      25%             10       55
                                                                                          50%           54.5       55
                                                                                          75%          99.25       55
                                                                                          max            100       55
                                                                                          mean         54.75       55
                                                                                          median        54.5       55
                                                                                          min             10       55
                                                                                          mode            10       55
                                                                                          std         51.674        0
                                                                                   1      25%          71.25       55
                                                                                          50%           74.5       55
                                                                                          75%          77.75       55
                                                                                          max             79       55
                                                                                          mean          74.5       55
                                                                                          median        74.5       55
                                                                                          min             70       55
                                                                                          mode            71       55
                                                                                          mode            72       55
                                                                                          mode            77       55
                                                                                          mode            78       55
                                                                                          mode            79       55
                                                                                          mode            70       55
                                                                                          std          3.937        0
                                                                                   2      25%           80.5       55
                                                                                          50%             81       55
                                                                                          75%           81.5       55
                                                                                          max             82       55
                                                                                          mean            81       55
                                                                                          median          81       55
                                                                                          min             80       55
                                                                                          mode            80       55
                                                                                          mode            81       55
                                                                                          mode            82       55
                                                                                          std              1        0
                                                                                   44     25%             43       55
                                                                                          50%             43       55
                                                                                          75%             53       55
                                                                                          max             56       55
                                                                                          mean        48.077       55
                                                                                          median          43       55
                                                                                          min             43       55
                                                                                          mode            43       55
                                                                                          std          5.766        0
    >>>
 
    #
    # Example 3: Get the basic statistics for time series aggregation for all the numeric columns,
    #            consider only unique values.
    #            This returns max, mean, min and std values.
    #
    >>> ocean_buoys_grpby.describe(distinct=True)
                                                                                               temperature salinity
    TIMECODE_RANGE                                     GROUP BY TIME(CAL_YEARS(2)) buoyid func
    ('2014-01-01 00:00:00.000000-00:00', '2016-01-0... 2                           0      max          100       55
                                                                                          mean      69.667       55
                                                                                          min           10       55
                                                                                          std       51.675     None
                                                                                   1      max           79       55
                                                                                          mean        74.5       55
                                                                                          min           70       55
                                                                                          std        3.937     None
                                                                                   2      max           82       55
                                                                                          mean          81       55
                                                                                          min           80       55
                                                                                          std            1     None
                                                                                   44     max           56       55
                                                                                          mean        52.2       55
                                                                                          min           43       55
                                                                                          std        5.263     None
    >>>
 
    #
    # Example 4: Get the verbose statistics for time series aggregation for all the numeric columns.
    #            This select non-default percentiles 33rd and 66th.
    #            This returns max, mean, min, std, median, mode, 33rd, and 66th percentile.
    #
    >>> ocean_buoys_grpby.describe(verbose=True, percentiles=[0.33, 0.66])
                                                                                                 temperature salinity
    TIMECODE_RANGE                                     GROUP BY TIME(CAL_YEARS(2)) buoyid func
    ('2014-01-01 00:00:00.000000-00:00', '2016-01-0... 2                           0      33%             10       55
                                                                                          66%          97.22       55
                                                                                          max            100       55
                                                                                          mean         54.75       55
                                                                                          median        54.5       55
                                                                                          min             10       55
                                                                                          mode            10       55
                                                                                          std         51.674        0
                                                                                   1      33%          71.65       55
                                                                                          66%           77.3       55
                                                                                          max             79       55
                                                                                          mean          74.5       55
                                                                                          median        74.5       55
                                                                                          min             70       55
                                                                                          mode            70       55
                                                                                          mode            71       55
                                                                                          mode            77       55
                                                                                          mode            78       55
                                                                                          mode            79       55
                                                                                          mode            72       55
                                                                                          std          3.937        0
                                                                                   2      33%          80.66       55
                                                                                          66%          81.32       55
                                                                                          max             82       55
                                                                                          mean            81       55
                                                                                          median          81       55
                                                                                          min             80       55
                                                                                          mode            80       55
                                                                                          mode            81       55
                                                                                          mode            82       55
                                                                                          std              1        0
                                                                                   44     33%             43       55
                                                                                          66%             53       55
                                                                                          max             56       55
                                                                                          mean        48.077       55
                                                                                          median          43       55
                                                                                          min             43       55
                                                                                          mode            43       55
                                                                                          std          5.766        0
    >>>