ts.describe() | Teradata R Package - 17.00 - ts.describe() - Teradata R Package

Teradata® R Package User Guide

prodname
Teradata R Package
vrm_release
17.00
created_date
November 2020
category
User Guide
featnum
B700-4005-090K

The aggregate function ts.describe() returns the column statistics of the values in each group. By default, the function calculates maximum, minimum, average, sample standard deviation, median, mode, 25th percentile, 50th percentile and 75th percentile values of the column.

Unlike other aggregate functions, ts.describe() function generates multiple columns. The column names provided in summarise() call are not used as the resultant column name.

  • Nulls are not included in the result computation.
  • ts.describe() is valid only for numeric data.
  • Unlike other statistical values, mode values are calculated without excluding duplicates even though distinct() is used.
  • ts.describe() must be used with group_by_time().
  • Mode computation can result in duplicate rows with different values.
  • Percentile computations use a linear interpolation process to determine if the percentile lies between two data points.

    Refer to ts.percentile() for additional interpolation options.

Arguments:
  • value.expression: Specify the column for which the statistics are to be computed.

Use ts.describe(distinct(column_name)) to exclude duplicate rows while calculating statistics.

Example 1: Calculate the statistics of the 'temperature' column of sequenced PTI table

  • Calculate the statistics.
    > df_seq_describe <- df_seq_grp %>% summarise(describe_temp = ts.describe(temperature))
  • Print the results.
    > df_seq_describe %>% arrange(TIMECODE_RANGE, buoyid, `MODE(temperature)`)
    # Source:     lazy query [?? x 12]
    # Database:   [Teradata 16.20.50.01] [Teradata Native Driver 17.0.0.2]
    #   [TDAPUSER@<hostname>/TDAPUSERDB]
    # Ordered by: TIMECODE_RANGE, buoyid, `MODE(temperature)`
       TIMECODE_RANGE `GROUP BY TIME(~ buoyid `Maximum(temper~ `Minimum(temper~ `Average(temper~
       <chr>          <int64>           <int>            <int>            <int>            <dbl>
     1 2014-01-06 08~ 35345                 0              100               10             54.8
     2 2014-01-06 09~ 35347                 1               79               70             74.5
     3 2014-01-06 09~ 35347                 1               79               70             74.5
     4 2014-01-06 09~ 35347                 1               79               70             74.5
     5 2014-01-06 09~ 35347                 1               79               70             74.5
     6 2014-01-06 09~ 35347                 1               79               70             74.5
     7 2014-01-06 09~ 35347                 1               79               70             74.5
     8 2014-01-06 10~ 35349                44               56               43             49 
     9 2014-01-06 10~ 35350                22               23               23             23 
    10 2014-01-06 10~ 35350                44               43               43             43 
    # ... with more rows, and 6 more variables: `STDDEV_SAMP(temperature)` <dbl>,
    #   `MEDIAN(temperature)` <dbl>, `MODE(temperature)` <int>, `PERCENTILE(temperature, 25,
    #   LINEAR)` <dbl>, `PERCENTILE(temperature, 50, LINEAR)` <dbl>, `PERCENTILE(temperature,
    #   75, LINEAR)` <dbl>

Example 2: Calculate the statistics of the 'temperature' column of non-PTI table

  • Calculate the statistics.
    > df_nonpti_describe <- df_nonpti %>% group_by_time(timebucket.duration = "10m", timecode.column = "TIMECODE") %>% summarise(describe_temp = ts.describe(distinct(temperature)))
  • Print the results.
    > df_nonpti_describe %>% arrange(TIMECODE_RANGE, `Mode(temperature)`)
    # Source:     lazy query [?? x 11]
    # Database:   [Teradata 16.20.50.01] [Teradata Native Driver 17.0.0.2]
    #   [TDAPUSER@<hostname>/TDAPUSERDB]
    # Ordered by: TIMECODE_RANGE, `Mode(temperature)`
       TIMECODE_RANGE `GROUP BY TIME(~ `Maximum(Distin~ `Minimum(Distin~ `Average(Distin~
       <chr>          <int64>                     <int>            <int>            <dbl>
     1 2014-01-06 08~ 2314993                        99               10             54.5
     2 2014-01-06 08~ 2314993                        99               10             54.5
     3 2014-01-06 08~ 2314994                       100               10             55 
     4 2014-01-06 08~ 2314994                       100               10             55 
     5 2014-01-06 09~ 2314999                        79               70             74.5
     6 2014-01-06 09~ 2314999                        79               70             74.5
     7 2014-01-06 09~ 2314999                        79               70             74.5
     8 2014-01-06 09~ 2314999                        79               70             74.5
     9 2014-01-06 09~ 2314999                        79               70             74.5
    10 2014-01-06 09~ 2314999                        79               70             74.5
    # ... with more rows, and 6 more variables: `STDDEV_SAMP(Distinct(temperature))` <dbl>,
    #   `MEDIAN(Distinct(temperature))` <dbl>, `MODE(temperature)` <int>,
    #   `PERCENTILE(Distinct(temperature, 25, LINEAR))` <dbl>, `PERCENTILE(Distinct(temperature,
    #   50, LINEAR))` <dbl>, `PERCENTILE(Distinct(temperature, 75, LINEAR))` <dbl>