The aggregate function ts.describe() returns the column statistics of the values in each group. By default, the function calculates maximum, minimum, average, sample standard deviation, median, mode, 25th percentile, 50th percentile and 75th percentile values of the column.
Unlike other aggregate functions, ts.describe() function generates multiple columns. The column names provided in summarise() call are not used as the resultant column name.
- Nulls are not included in the result computation.
- ts.describe() is valid only for numeric data.
- Unlike other statistical values, mode values are calculated without excluding duplicates even though distinct() is used.
- ts.describe() must be used with group_by_time().
- Mode computation can result in duplicate rows with different values.
- Percentile computations use a linear interpolation process to determine if the percentile lies between two data points.
Refer to ts.percentile() for additional interpolation options.
Arguments:
- value.expression: Specify the column for which the statistics are to be computed.
Use ts.describe(distinct(column_name)) to exclude duplicate rows while calculating statistics.
Example 1: Calculate the statistics of the 'temperature' column of sequenced PTI table
- Calculate the statistics.
> df_seq_describe <- df_seq_grp %>% summarise(describe_temp = ts.describe(temperature))
- Print the results.
> df_seq_describe %>% arrange(TIMECODE_RANGE, buoyid, `MODE(temperature)`) # Source: lazy query [?? x 12] # Database: [Teradata 16.20.50.01] [Teradata Native Driver 17.0.0.2] # [TDAPUSER@<hostname>/TDAPUSERDB] # Ordered by: TIMECODE_RANGE, buoyid, `MODE(temperature)` TIMECODE_RANGE `GROUP BY TIME(~ buoyid `Maximum(temper~ `Minimum(temper~ `Average(temper~ <chr> <int64> <int> <int> <int> <dbl> 1 2014-01-06 08~ 35345 0 100 10 54.8 2 2014-01-06 09~ 35347 1 79 70 74.5 3 2014-01-06 09~ 35347 1 79 70 74.5 4 2014-01-06 09~ 35347 1 79 70 74.5 5 2014-01-06 09~ 35347 1 79 70 74.5 6 2014-01-06 09~ 35347 1 79 70 74.5 7 2014-01-06 09~ 35347 1 79 70 74.5 8 2014-01-06 10~ 35349 44 56 43 49 9 2014-01-06 10~ 35350 22 23 23 23 10 2014-01-06 10~ 35350 44 43 43 43 # ... with more rows, and 6 more variables: `STDDEV_SAMP(temperature)` <dbl>, # `MEDIAN(temperature)` <dbl>, `MODE(temperature)` <int>, `PERCENTILE(temperature, 25, # LINEAR)` <dbl>, `PERCENTILE(temperature, 50, LINEAR)` <dbl>, `PERCENTILE(temperature, # 75, LINEAR)` <dbl>
Example 2: Calculate the statistics of the 'temperature' column of non-PTI table
- Calculate the statistics.
> df_nonpti_describe <- df_nonpti %>% group_by_time(timebucket.duration = "10m", timecode.column = "TIMECODE") %>% summarise(describe_temp = ts.describe(distinct(temperature)))
- Print the results.
> df_nonpti_describe %>% arrange(TIMECODE_RANGE, `Mode(temperature)`) # Source: lazy query [?? x 11] # Database: [Teradata 16.20.50.01] [Teradata Native Driver 17.0.0.2] # [TDAPUSER@<hostname>/TDAPUSERDB] # Ordered by: TIMECODE_RANGE, `Mode(temperature)` TIMECODE_RANGE `GROUP BY TIME(~ `Maximum(Distin~ `Minimum(Distin~ `Average(Distin~ <chr> <int64> <int> <int> <dbl> 1 2014-01-06 08~ 2314993 99 10 54.5 2 2014-01-06 08~ 2314993 99 10 54.5 3 2014-01-06 08~ 2314994 100 10 55 4 2014-01-06 08~ 2314994 100 10 55 5 2014-01-06 09~ 2314999 79 70 74.5 6 2014-01-06 09~ 2314999 79 70 74.5 7 2014-01-06 09~ 2314999 79 70 74.5 8 2014-01-06 09~ 2314999 79 70 74.5 9 2014-01-06 09~ 2314999 79 70 74.5 10 2014-01-06 09~ 2314999 79 70 74.5 # ... with more rows, and 6 more variables: `STDDEV_SAMP(Distinct(temperature))` <dbl>, # `MEDIAN(Distinct(temperature))` <dbl>, `MODE(temperature)` <int>, # `PERCENTILE(Distinct(temperature, 25, LINEAR))` <dbl>, `PERCENTILE(Distinct(temperature, # 50, LINEAR))` <dbl>, `PERCENTILE(Distinct(temperature, 75, LINEAR))` <dbl>