Teradata R Package Function Reference | 17.00 - 17.00 - Regular Aggregates - Teradata R Package

Teradata® R Package Function Reference

prodname
Teradata R Package
vrm_release
17.00
created_date
September 2020
category
Programming Reference
featnum
B700-4007-090K

Introduction to Regular Aggregates

The regular aggregate functions help perform aggregate operations on non-time series data. The goal of this vignette is to provide introductory examples to get the user familiar with regular aggregate functions.

Each aggregate function has a section in this vignette. Each section can be read in isolation, except where references are made to other sections or other vignettes. Each section also has a subsection called “Things to Note” for good practices and specific behavior to consider when using the aggregate functions with Teradata Vantage.

Once the connection is established using td_create_context, you can load the example table ocean_buoys_seq and create respective tibble using the following statements:

loadExampleData("time_series_example", "ocean_buoys_seq")
# Create tbl_teradata objects.
df_seq <- tbl(con, "ocean_buoys_seq")

Median : median()

The aggregate function median returns the median of all values in each group. The function returns the average of the two middle values if the argument value.expression contains an even number of values. median uses one argument named value.expression, which specifies the column for which median is to be computed.

# Calculate the median value of the 'temperature' column grouped by 'buoyid' column.
df_median <- df_seq %>% group_by(buoyid) %>% summarise(median_temp = median(temperature))

# Print the results.
df_median %>% arrange(buoyid)
#> # Source:     lazy query [?? x 2]
#> # Database:   Teradata
#> # Ordered by: buoyid
#>   buoyid median_temp
#>    <int>       <int>
#> 1      0          54
#> 2      1          74
#> 3      2          81
#> 4     22          23
#> 5     44          43

# Calculate the median value of the 'temperature' column without any grouping.
df_median <- df_seq %>% summarise(median_temp = median(temperature))

# Print the results.
df_median
#> # Source:   lazy query [?? x 1]
#> # Database: Teradata
#>   median_temp
#>         <int>
#> 1          54

Notes

  • The column containing median values is of NUMERIC data type.
  • NULLs/NAs are not included in the result computation.
  • median is valid only for numeric data.
  • Use of distinct to exclude duplicates is not allowed for median function. This will throw an error as shown in the example 1 below. However, duplicate values can be excluded in the computation for time series aggregate function ts.median. For more information on time series aggregation function ts.median, please refer to Time Series Aggregates vignettes.
  • Teradata recommends to strictly use the function median(value.expression) for regular aggregate operation with/without grouping.
  • The function median cannot be used with dplyr verbs like mutate, filter as shown in the example 2 below.
# Example 1

# Using 'distinct()' to calculate median on distinct temperature values.
df_median <- df_seq %>% group_by(buoyid) %>% summarise(median_temp = median(distinct(temperature)))

# Print the results.
# This throws an error "Illegal use of DISTINCT Aggregate Expressions."
df_median

# Example 2

# Incorrect usage of median() with mutate and filter.
df_median <- df_seq %>% mutate(median_t = median(temperature))
df_median <- df_seq %>% filter(temperature <= median(temperature))

Kurtosis : kurtosis()

The aggregate function kurtosis measures the tailedness of the probability distribution of a column. Kurtosis is the fourth moment of the distribution of the standardized (z) values. It is a measure of the outlier (rare, extreme observation) character of the distribution as compared to the normal (Gaussian) distribution.

  • The normal distribution has a kurtosis of 0.
  • Positive kurtosis indicates that the distribution is more outlier-prone (deviation from the mean) than the normal distribution.
  • Negative kurtosis indicates that the distribution is less outlier-prone (deviation from the mean) than the normal distribution.

kurtosis uses one argument named value.expression, which specifies the column for which kurtosis is to be computed. Use kurtosis(distinct(column_name)) to exclude duplicate rows while calculating kurtosis.

# Calculate the kurtosis value of the 'temperature' column grouped by 'buoyid' column.
df_kurtosis <- df_seq %>% group_by(buoyid) %>% summarise(kurtosis_temp = kurtosis(temperature))

# Print the results.
df_kurtosis %>% arrange(buoyid)
#> # Source:     lazy query [?? x 2]
#> # Database:   Teradata
#> # Ordered by: buoyid
#>   buoyid kurtosis_temp
#>    <int>         <dbl>
#> 1      0         -6.00
#> 2      1         -2.76
#> 3      2         NA   
#> 4     22         NA   
#> 5     44         -2.20

# Exclude duplicates in the same aggregate operation above.
df_kurtosis <- df_seq %>% group_by(buoyid) %>% summarise(kurtosis_temp = kurtosis(distinct(temperature)))

# Print the results.
df_kurtosis %>% arrange(buoyid)
#> # Source:     lazy query [?? x 2]
#> # Database:   Teradata
#> # Ordered by: buoyid
#>   buoyid kurtosis_temp
#>    <int>         <dbl>
#> 1      0         NA   
#> 2      1         -2.76
#> 3      2         NA   
#> 4     22         NA   
#> 5     44          4.13

# Calculate the kurtosis value of the 'temperature' column without any grouping.
df_kurtosis <- df_seq %>% summarise(kurtosis_temp = kurtosis(temperature))

# Print the results.
df_kurtosis
#> # Source:   lazy query [?? x 1]
#> # Database: Teradata
#>   kurtosis_temp
#>           <dbl>
#> 1        -0.198

# Exclude duplicates in the same aggregate operation above.
df_kurtosis <- df_seq %>% summarise(kurtosis_temp = kurtosis(distinct(temperature)))

# Print the results.
df_kurtosis
#> # Source:   lazy query [?? x 1]
#> # Database: Teradata
#>   kurtosis_temp
#>           <dbl>
#> 1         0.742

Notes

  • NULLs/NAs are not included in the result computation.
  • The time series aggregate function ts.kurtosis can also be used with regular aggregate operation i.e., the functions ts.kurtosis and kurtosis can be used interchangeably. For more information on time series aggregation function ts.kurtosis, please refer to Time Series Aggregates vignette. However, Teradata recommends to strictly use kurtosis for regular aggregate operation.
  • Following conditions will produce NULL result:
    • Fewer than three non-NULL data points in the data used for the computation.
    • Standard deviation for a column is equal to 0.
  • kurtosis cannot be used with dplyr verbs like mutate, filter as shown in the example below.
# Incorrect usage of kurtosis() with mutate and filter.
df_kurtosis <- df_seq %>% mutate(kurtosis_t = kurtosis(temperature))
df_kurtosis <- df_seq %>% filter(temperature <= kurtosis(temperature))

Skew : skew()

The aggregate function skew measures the skewness of the distribution of a column. Skewness is the third moment of a distribution. It is a measure of the asymmetry of the distribution about its mean compared to the normal (Gaussian) distribution.

  • The normal distribution has a skewness of 0.
  • Positive skewness indicates the distribution having an asymmetric tail extending toward more positive values.
  • Negative skewness indicates the distribution having an asymmetric tail extending toward more negative values.

skew uses one argument named value.expression, which specifies the column for which skewness is to be computed. Use skew(distinct(column_name)) to exclude duplicate rows while calculating skew.

# Calculate the skewness of the 'temperature' column grouped by 'buoyid' column.
df_skew <- df_seq %>% group_by(buoyid) %>% summarise(skew_temp = skew(temperature))

# Print the results.
df_skew %>% arrange(buoyid)
#> # Source:     lazy query [?? x 2]
#> # Database:   Teradata
#> # Ordered by: buoyid
#>   buoyid skew_temp
#>    <int>     <dbl>
#> 1      0  0.000324
#> 2      1  0       
#> 3      2  0       
#> 4     22 NA       
#> 5     44  0.246

# Exclude duplicates in the same aggregate operation above.
df_skew <- df_seq %>% group_by(buoyid) %>% summarise(skew_temp = skew(distinct(temperature)))

# Print the results.
df_skew %>% arrange(buoyid)
#> # Source:     lazy query [?? x 2]
#> # Database:   Teradata
#> # Ordered by: buoyid
#>   buoyid skew_temp
#>    <int>     <dbl>
#> 1      0     -1.73
#> 2      1      0   
#> 3      2      0   
#> 4     22     NA   
#> 5     44     -1.99

# Calculate the skewness of the 'temperature' column without any grouping.
df_skew <- df_seq %>% summarise(skew_temp = skew(temperature))

# Print the results.
df_skew
#> # Source:   lazy query [?? x 1]
#> # Database: Teradata
#>   skew_temp
#>       <dbl>
#> 1    -0.205

# Exclude duplicates in the same aggregate operation above.
df_skew <- df_seq %>% summarise(skew_temp = skew(distinct(temperature)))

# Print the results.
df_skew
#> # Source:   lazy query [?? x 1]
#> # Database: Teradata
#>   skew_temp
#>       <dbl>
#> 1    -0.877

Notes

  • NULLs/NAs are not included in the result computation.
  • The time series aggregate function ts.skew can also be used with regular aggregate operation i.e., the functions ts.skew and skew can be used interchangeably. For more information on time series aggregation function ts.skew, please refer to Time Series Aggregates vignette. However, Teradata recommends to strictly use skew for regular aggregate operation.
  • Following conditions will produce NULL result:
    • Fewer than three non-NULL data points in the data used for the computation.
    • Standard deviation for a column is equal to 0.
  • skew cannot be used with dplyr verbs like mutate, filter as shown in the example below.
# Incorrect usage of skew() with mutate and filter.
df_skew <- df_seq %>% mutate(skew_t = skew(temperature))
df_skew <- df_seq %>% filter(temperature <= skew(temperature))

Sum : sum()

The aggregate function sum returns the sum of values in the column. For more information on sum aggregate function, please refer to Teradata SQL translation vignette. Use sum(distinct(column_name)) to exclude duplicate rows while calculating sum.

# Calculate the sum of the values in the 'temperature' column grouped by 'buoyid' column.
# Please note that the output of the function remains the same (excludes NULL/NA values in computation) irrespective of the argument 'na.rm' set to TRUE. Warning 'Missing values are always removed in SQL.' will be suppressed if this is set to TRUE, otherwise the warning is raised.
df_sum <- df_seq %>% group_by(buoyid) %>% summarise(sum_temp = sum(temperature, na.rm = TRUE))

# Print the results.
df_sum %>% arrange(buoyid)
#> # Source:     lazy query [?? x 2]
#> # Database:   Teradata
#> # Ordered by: buoyid
#>   buoyid sum_temp
#>    <int>    <int>
#> 1      0      219
#> 2      1      447
#> 3      2      243
#> 4     22       23
#> 5     44      625

# Exclude duplicates in the same aggregate operation above.
df_sum <- df_seq %>% group_by(buoyid) %>% summarise(sum_temp = sum(distinct(temperature)))

# Print the results.
df_sum %>% arrange(buoyid)
#> Warning: Missing values are always removed in SQL.
#> Use `SUM(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> # Source:     lazy query [?? x 2]
#> # Database:   Teradata
#> # Ordered by: buoyid
#>   buoyid sum_temp
#>    <int>    <int>
#> 1      0      209
#> 2      1      447
#> 3      2      243
#> 4     22       23
#> 5     44      261

# Calculate the sum of the values in the 'temperature' column without any grouping.
df_sum <- df_seq %>% summarise(sum_temp = sum(temperature))

# Print the results.
df_sum
#> # Source:   lazy query [?? x 1]
#> # Database: Teradata
#>   sum_temp
#>      <int>
#> 1     1557

# Exclude duplicates in the same aggregate operation above.
df_sum <- df_seq %>% summarise(sum_temp = sum(distinct(temperature)))

# Print the results.
df_sum
#> # Source:   lazy query [?? x 1]
#> # Database: Teradata
#>   sum_temp
#>      <int>
#> 1     1183

The function sum can be used as window aggregates using dplyr verbs like mutate, filter.

# Create a new column 'sum_t' which holds the sum of values in the 'temperature' column.
df_sum <- df_seq %>% mutate(sum_t = sum(temperature))

# Print the query.
df_sum %>% show_query()
#> <SQL>
#> Warning: Missing values are always removed in SQL.
#> Use `SUM(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> SELECT "TD_TIMECODE", "TD_SEQNO", "buoyid", "salinity", "temperature", "dates", SUM("temperature") OVER () AS "sum_t"
#> FROM "ocean_buoys_seq"

# Print the results.
df_sum %>% arrange(TD_TIMECODE, TD_SEQNO)
#> # Source:     lazy query [?? x 7]
#> # Database:   Teradata
#> # Ordered by: TD_TIMECODE, TD_SEQNO
#>   TD_TIMECODE         TD_SEQNO buoyid salinity temperature dates      sum_t
#>   <dttm>                 <int>  <int>    <int>       <int> <date>     <int>
#> 1 2014-01-06 08:00:00       26      0       55          10 2016-02-26  1557
#> 2 2014-01-06 08:08:59       18      0       55          NA 2015-06-18  1557
#> 3 2014-01-06 08:09:59       17      0       55          99 2015-05-17  1557
#> 4 2014-01-06 08:10:00       19      0       55          10 2015-07-19  1557
#> 5 2014-01-06 08:10:00       27      0       55         100 2016-03-27  1557
#> 6 2014-01-06 09:01:25       11      1       55          70 2014-11-11  1557
#> # ... with more rows

# Filter the rows with temperature values less than one-hundredth of the sum of values in the 'temperature' column.
df_sum <- df_seq %>% filter(temperature < sum(temperature)/100)

# Print the results.
df_sum
#> Warning: `lang_name()` is deprecated as of rlang 0.2.0.
#> Please use `call_name()` instead.
#> This warning is displayed once per session.
#> Warning: `lang()` is deprecated as of rlang 0.2.0.
#> Please use `call2()` instead.
#> This warning is displayed once per session.
#> # Source:   lazy query [?? x 6]
#> # Database: Teradata
#>   TD_TIMECODE         TD_SEQNO buoyid salinity temperature dates     
#>   <dttm>                 <int>  <int>    <int>       <int> <date>    
#> 1 2014-01-06 08:10:00       19      0       55          10 2015-07-19
#> 2 2014-01-06 08:00:00       26      0       55          10 2016-02-26

Notes

  • NULLs/NAs are not included in the result computation.
  • Teradata recommends to strictly use sum, but not time series aggregate ts.sum, for regular aggregate operations and window aggregate operations.
  • The use of distinct is not permitted in window aggregate functions.

Sample Standard Deviation : sd()

The aggregate function sd returns the sample standard deviation of values of the column. The standard deviation is the second moment of a distribution. For more information on sd aggregate function, please refer to Teradata SQL translation vignette. Use sd(distinct(column_name)) to exclude duplicate rows while calculating sample standard deviation.

# Calculate the sample standard deviation values of the 'temperature' column grouped by 'buoyid' column.
# Please note that the output of the function remains the same (excludes NULL/NA values in computation) irrespective of the argument 'na.rm' set to TRUE. Warning 'Missing values are always removed in SQL.' will be suppressed if this is set to TRUE, otherwise the warning is raised.
df_sd <- df_seq %>% group_by(buoyid) %>% summarise(sd_temp = sd(temperature, na.rm = TRUE))

# Print the results.
df_sd %>% arrange(buoyid)
#> # Source:     lazy query [?? x 2]
#> # Database:   Teradata
#> # Ordered by: buoyid
#>   buoyid sd_temp
#>    <int>   <dbl>
#> 1      0   51.7 
#> 2      1    3.94
#> 3      2    1   
#> 4     22   NA   
#> 5     44    5.77

# Exclude duplicates in the same aggregate operation above.
df_sd <- df_seq %>% group_by(buoyid) %>% summarise(sd_temp = sd(distinct(temperature)))

# Print the results.
df_sd %>% arrange(buoyid)
#> Warning: Missing values are always removed in SQL.
#> Use `STDDEV_SAMP(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> # Source:     lazy query [?? x 2]
#> # Database:   Teradata
#> # Ordered by: buoyid
#>   buoyid sd_temp
#>    <int>   <dbl>
#> 1      0   51.7 
#> 2      1    3.94
#> 3      2    1   
#> 4     22   NA   
#> 5     44    5.26

# Calculate the sample standard deviation of the values in the 'temperature' column without any grouping.
df_sd <- df_seq %>% summarise(sd_temp = sd(temperature))

# Print the results.
df_sd
#> Warning: Missing values are always removed in SQL.
#> Use `STDDEV_SAMP(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> # Source:   lazy query [?? x 1]
#> # Database: Teradata
#>   sd_temp
#>     <dbl>
#> 1    23.4

# Exclude duplicates in the same aggregate operation above.
df_sd <- df_seq %>% summarise(sd_temp = sd(distinct(temperature)))

# Print the results.
df_sd
#> Warning: Missing values are always removed in SQL.
#> Use `STDDEV_SAMP(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> # Source:   lazy query [?? x 1]
#> # Database: Teradata
#>   sd_temp
#>     <dbl>
#> 1    23.6

The function sd can be used as window aggregates using dplyr verbs like mutate, filter.

# Create a new column 'sd_t' which holds the sample standard deviation of the values in the 'temperature' column.
df_sd <- df_seq %>% mutate(sd_t = sd(temperature))

# Print the query.
df_sd %>% show_query()
#> <SQL>
#> Warning: Missing values are always removed in SQL.
#> Use `STDDEV_SAMP(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> SELECT "TD_TIMECODE", "TD_SEQNO", "buoyid", "salinity", "temperature", "dates", STDDEV_SAMP("temperature") OVER () AS "sd_t"
#> FROM "ocean_buoys_seq"

# Print the results.
df_sd %>% arrange(TD_TIMECODE, TD_SEQNO)
#> # Source:     lazy query [?? x 7]
#> # Database:   Teradata
#> # Ordered by: TD_TIMECODE, TD_SEQNO
#>   TD_TIMECODE         TD_SEQNO buoyid salinity temperature dates       sd_t
#>   <dttm>                 <int>  <int>    <int>       <int> <date>     <dbl>
#> 1 2014-01-06 08:00:00       26      0       55          10 2016-02-26  23.4
#> 2 2014-01-06 08:08:59       18      0       55          NA 2015-06-18  23.4
#> 3 2014-01-06 08:09:59       17      0       55          99 2015-05-17  23.4
#> 4 2014-01-06 08:10:00       19      0       55          10 2015-07-19  23.4
#> 5 2014-01-06 08:10:00       27      0       55         100 2016-03-27  23.4
#> 6 2014-01-06 09:01:25       11      1       55          70 2014-11-11  23.4
#> # ... with more rows

# Filter the rows with temperature values less than the sample standard deviation of the values in the 'temperature' column.
df_sd <- df_seq %>% filter(temperature < sd(temperature))

# Print the results.
df_sd
#> # Source:   lazy query [?? x 6]
#> # Database: Teradata
#>   TD_TIMECODE         TD_SEQNO buoyid salinity temperature dates     
#>   <dttm>                 <int>  <int>    <int>       <int> <date>    
#> 1 2014-01-06 10:32:12        1     22       25          23 2014-01-01
#> 2 2014-01-06 08:10:00       19      0       55          10 2015-07-19
#> 3 2014-01-06 08:00:00       26      0       55          10 2016-02-26

Notes

  • NULLs/NAs are not included in the result computation.
  • Teradata recommends to strictly use sd for regular aggregate and window operations, instead of time series aggregate function ts.sd.
  • When there are fewer than two non-NULL data points in the sample used for the computation, sd returns NULL/NA.
  • Division by zero results in NULL/NA value rather than an error.
  • If data represents only a sample of the entire population for the column, Teradata recommends to use sd to calculate sample standard deviation instead of sdp which calculates population standard deviation. As the sample size increases, the values for sd and sdp approach the same number.
  • The use of distinct is not permitted in window aggregate functions.

Population Standard Deviation : sdp()

The aggregate function sdp returns the population standard deviation of values of the column. The standard deviation is the second moment of a distribution. For more information on sdp aggregate function, please refer to Teradata SQL translation vignette. Use sdp(distinct(column_name)) to exclude duplicate rows while calculating population standard deviation.

# Calculate the population standard deviation values of the 'temperature' column grouped by 'buoyid' column.
# Please note that the output of the function remains the same (excludes NULL/NA values in computation) irrespective of the argument 'na.rm' set to TRUE. Warning 'Missing values are always removed in SQL.' will be suppressed if this is set to TRUE, otherwise the warning is raised.
df_sdp <- df_seq %>% group_by(buoyid) %>% summarise(sdp_temp = sdp(temperature, na.rm = TRUE))

# Print the results.
df_sdp %>% arrange(buoyid)
#> # Source:     lazy query [?? x 2]
#> # Database:   Teradata
#> # Ordered by: buoyid
#>   buoyid sdp_temp
#>    <int>    <dbl>
#> 1      0   44.8  
#> 2      1    3.59 
#> 3      2    0.816
#> 4     22    0    
#> 5     44    5.54

# Exclude duplicates in the same aggregate operation above.
df_sdp <- df_seq %>% group_by(buoyid) %>% summarise(sdp_temp = sdp(distinct(temperature)))

# Print the results.
df_sdp %>% arrange(buoyid)
#> Warning: Missing values are always removed in SQL.
#> Use `STDDEV_POP(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> # Source:     lazy query [?? x 2]
#> # Database:   Teradata
#> # Ordered by: buoyid
#>   buoyid sdp_temp
#>    <int>    <dbl>
#> 1      0   42.2  
#> 2      1    3.59 
#> 3      2    0.816
#> 4     22    0    
#> 5     44    4.71

# Calculate the population standard deviation of the values in the 'temperature' column without any grouping.
df_sdp <- df_seq %>% summarise(sdp_temp = sdp(temperature))

# Print the results.
df_sdp
#> Warning: Missing values are always removed in SQL.
#> Use `STDDEV_POP(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> # Source:   lazy query [?? x 1]
#> # Database: Teradata
#>   sdp_temp
#>      <dbl>
#> 1     23.0

# Exclude duplicates in the same aggregate operation above.
df_sdp <- df_seq %>% summarise(sdp_temp = sdp(distinct(temperature)))

# Print the results.
df_sdp
#> Warning: Missing values are always removed in SQL.
#> Use `STDDEV_POP(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> # Source:   lazy query [?? x 1]
#> # Database: Teradata
#>   sdp_temp
#>      <dbl>
#> 1     22.9

The function sdp can be used as window aggregates using dplyr verbs like mutate, filter.

# Create a new column 'sd_t' which holds the population standard deviation of the values in the 'temperature' column.
df_sdp <- df_seq %>% mutate(sd_t = sdp(temperature))

# Print the query.
df_sdp %>% show_query()
#> <SQL>
#> Warning: Missing values are always removed in SQL.
#> Use `STDDEV_POP(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> SELECT "TD_TIMECODE", "TD_SEQNO", "buoyid", "salinity", "temperature", "dates", STDDEV_POP("temperature") OVER () AS "sd_t"
#> FROM "ocean_buoys_seq"

# Print the results.
df_sdp %>% arrange(TD_TIMECODE, TD_SEQNO)
#> Warning: Missing values are always removed in SQL.
#> Use `STDDEV_POP(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> # Source:     lazy query [?? x 7]
#> # Database:   Teradata
#> # Ordered by: TD_TIMECODE, TD_SEQNO
#>   TD_TIMECODE         TD_SEQNO buoyid salinity temperature dates       sd_t
#>   <dttm>                 <int>  <int>    <int>       <int> <date>     <dbl>
#> 1 2014-01-06 08:00:00       26      0       55          10 2016-02-26  23.0
#> 2 2014-01-06 08:08:59       18      0       55          NA 2015-06-18  23.0
#> 3 2014-01-06 08:09:59       17      0       55          99 2015-05-17  23.0
#> 4 2014-01-06 08:10:00       19      0       55          10 2015-07-19  23.0
#> 5 2014-01-06 08:10:00       27      0       55         100 2016-03-27  23.0
#> 6 2014-01-06 09:01:25       11      1       55          70 2014-11-11  23.0
#> # ... with more rows

# Filter the rows with temperature values less than the population standard deviation of the values in the 'temperature' column.
df_sdp <- df_seq %>% filter(temperature < sdp(temperature))

# Print the results.
df_sdp
#> Warning: Missing values are always removed in SQL.
#> Use `STDDEV_POP(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> # Source:   lazy query [?? x 6]
#> # Database: Teradata
#>   TD_TIMECODE         TD_SEQNO buoyid salinity temperature dates     
#>   <dttm>                 <int>  <int>    <int>       <int> <date>    
#> 1 2014-01-06 08:10:00       19      0       55          10 2015-07-19
#> 2 2014-01-06 08:00:00       26      0       55          10 2016-02-26

Notes

  • NULLs/NAs are not included in the result computation.
  • Teradata recommends to strictly use sdp for regular aggregate and window operations, instead of time series aggregate function ts.sdp.
  • When there are fewer than two non-NULL data points in the population used for the computation, sdp returns NULL/NA.
  • Division by zero results in NULL/NA value rather than an error.
  • sdp can only be used if data represents entire population. Otherwise, Teradata recommends to use sd to calculate sample standard deviation.
  • The use of distinct is not permitted in window aggregate functions.

Minimum : min()

The aggregate function min returns the minimum value in the column. For more information on min aggregate function, please refer to Teradata SQL translation vignette. Use min(distinct(column_name)) to exclude duplicate rows while calculating minimum value.

# Calculate the minimum value in the 'temperature' column grouped by 'buoyid' column.
# Please note that the output of the function remains the same (excludes NULL/NA values in computation) irrespective of the argument 'na.rm' set to TRUE. Warning 'Missing values are always removed in SQL.' will be suppressed if this is set to TRUE, otherwise the warning is raised.
df_min <- df_seq %>% group_by(buoyid) %>% summarise(min_temp = min(temperature, na.rm = TRUE))

# Print the results.
df_min %>% arrange(buoyid)
#> # Source:     lazy query [?? x 2]
#> # Database:   Teradata
#> # Ordered by: buoyid
#>   buoyid min_temp
#>    <int>    <int>
#> 1      0       10
#> 2      1       70
#> 3      2       80
#> 4     22       23
#> 5     44       43

# Exclude duplicates in the same aggregate operation above.
df_min <- df_seq %>% group_by(buoyid) %>% summarise(min_temp = min(distinct(temperature)))

# Print the results.
df_min %>% arrange(buoyid)
#> Warning: Missing values are always removed in SQL.
#> Use `MIN(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> # Source:     lazy query [?? x 2]
#> # Database:   Teradata
#> # Ordered by: buoyid
#>   buoyid min_temp
#>    <int>    <int>
#> 1      0       10
#> 2      1       70
#> 3      2       80
#> 4     22       23
#> 5     44       43

# Calculate the minimum value in the 'temperature' column without any grouping.
df_min <- df_seq %>% summarise(min_temp = min(temperature))

# Print the results.
df_min
#> # Source:   lazy query [?? x 1]
#> # Database: Teradata
#>   min_temp
#>      <int>
#> 1       10

# Exclude duplicates in the same aggregate operation above.
df_min <- df_seq %>% summarise(min_temp = min(distinct(temperature)))

# Print the results.
df_min
#> # Source:   lazy query [?? x 1]
#> # Database: Teradata
#>   min_temp
#>      <int>
#> 1       10

The function min can be used as window aggregates using dplyr verbs like mutate, filter.

# Create a new column 'min_t' which holds the minimum value in the 'temperature' column.
df_min <- df_seq %>% mutate(min_t = min(temperature))

# Print the query.
df_min %>% show_query()
#> <SQL>
#> Warning: Missing values are always removed in SQL.
#> Use `MIN(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> SELECT "TD_TIMECODE", "TD_SEQNO", "buoyid", "salinity", "temperature", "dates", MIN("temperature") OVER () AS "min_t"
#> FROM "ocean_buoys_seq"

# Print the results.
df_min %>% arrange(TD_TIMECODE, TD_SEQNO)
#> # Source:     lazy query [?? x 7]
#> # Database:   Teradata
#> # Ordered by: TD_TIMECODE, TD_SEQNO
#>   TD_TIMECODE         TD_SEQNO buoyid salinity temperature dates      min_t
#>   <dttm>                 <int>  <int>    <int>       <int> <date>     <int>
#> 1 2014-01-06 08:00:00       26      0       55          10 2016-02-26    10
#> 2 2014-01-06 08:08:59       18      0       55          NA 2015-06-18    10
#> 3 2014-01-06 08:09:59       17      0       55          99 2015-05-17    10
#> 4 2014-01-06 08:10:00       19      0       55          10 2015-07-19    10
#> 5 2014-01-06 08:10:00       27      0       55         100 2016-03-27    10
#> 6 2014-01-06 09:01:25       11      1       55          70 2014-11-11    10
#> # ... with more rows

# Filter the rows with one-tenth of temperature values greater than or equal to the minimum value in 'temperature' column.
df_min <- df_seq %>% filter(temperature/10 >= min(temperature))

# Print the results.
df_min
#> # Source:   lazy query [?? x 6]
#> # Database: Teradata
#>   TD_TIMECODE         TD_SEQNO buoyid salinity temperature dates     
#>   <dttm>                 <int>  <int>    <int>       <int> <date>    
#> 1 2014-01-06 08:10:00       27      0       55         100 2016-03-27

Notes

  • NULLs/NAs are not included in the result computation.
  • Teradata recommends to strictly use min, but not time series aggregate ts.min, for regular aggregate operations and window aggregate operations.
  • The use of distinct is not permitted in window aggregate functions.

Maximum : max()

The aggregate function max returns the maximum value in the column. For more information on max aggregate function, please refer to Teradata SQL translation vignette. Use max(distinct(column_name)) to exclude duplicate rows while calculating maximum value.

# Calculate the maximum value in the 'temperature' column grouped by 'buoyid' column.
# Please note that the output of the function remains the same (excludes NULL/NA values in computation) irrespective of the argument 'na.rm' set to TRUE. Warning 'Missing values are always removed in SQL.' will be suppressed if this is set to TRUE, otherwise the warning is raised.
df_max <- df_seq %>% group_by(buoyid) %>% summarise(max_temp = max(temperature, na.rm = TRUE))

# Print the results.
df_max %>% arrange(buoyid)
#> # Source:     lazy query [?? x 2]
#> # Database:   Teradata
#> # Ordered by: buoyid
#>   buoyid max_temp
#>    <int>    <int>
#> 1      0      100
#> 2      1       79
#> 3      2       82
#> 4     22       23
#> 5     44       56

# Exclude duplicates in the same aggregate operation above.
df_max <- df_seq %>% group_by(buoyid) %>% summarise(max_temp = max(distinct(temperature)))

# Print the results.
df_max %>% arrange(buoyid)
#> Warning: Missing values are always removed in SQL.
#> Use `MAX(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> # Source:     lazy query [?? x 2]
#> # Database:   Teradata
#> # Ordered by: buoyid
#>   buoyid max_temp
#>    <int>    <int>
#> 1      0      100
#> 2      1       79
#> 3      2       82
#> 4     22       23
#> 5     44       56

# Calculate the maximum value in the 'temperature' column without any grouping.
df_max <- df_seq %>% summarise(max_temp = max(temperature))

# Print the results.
df_max
#> # Source:   lazy query [?? x 1]
#> # Database: Teradata
#>   max_temp
#>      <int>
#> 1      100

# Exclude duplicates in the same aggregate operation above.
df_max <- df_seq %>% summarise(max_temp = max(distinct(temperature)))

# Print the results.
df_max
#> # Source:   lazy query [?? x 1]
#> # Database: Teradata
#>   max_temp
#>      <int>
#> 1      100

The function max can be used as window aggregates using dplyr verbs like mutate, filter.

# Create a new column 'max_t' which holds the maximum value in the 'temperature' column.
df_max <- df_seq %>% mutate(max_t = max(temperature))

# Print the query.
df_max %>% show_query()
#> <SQL>
#> Warning: Missing values are always removed in SQL.
#> Use `MAX(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> SELECT "TD_TIMECODE", "TD_SEQNO", "buoyid", "salinity", "temperature", "dates", MAX("temperature") OVER () AS "max_t"
#> FROM "ocean_buoys_seq"

# Print the results.
df_max %>% arrange(TD_TIMECODE, TD_SEQNO)
#> # Source:     lazy query [?? x 7]
#> # Database:   Teradata
#> # Ordered by: TD_TIMECODE, TD_SEQNO
#>   TD_TIMECODE         TD_SEQNO buoyid salinity temperature dates      max_t
#>   <dttm>                 <int>  <int>    <int>       <int> <date>     <int>
#> 1 2014-01-06 08:00:00       26      0       55          10 2016-02-26   100
#> 2 2014-01-06 08:08:59       18      0       55          NA 2015-06-18   100
#> 3 2014-01-06 08:09:59       17      0       55          99 2015-05-17   100
#> 4 2014-01-06 08:10:00       19      0       55          10 2015-07-19   100
#> 5 2014-01-06 08:10:00       27      0       55         100 2016-03-27   100
#> 6 2014-01-06 09:01:25       11      1       55          70 2014-11-11   100
#> # ... with more rows

# Filter the rows with temperature values less than or equal to one-tenth of the maximum value in 'temperature' column.
df_max <- df_seq %>% filter(temperature <= max(temperature)/10)

# Print the results.
df_max
#> # Source:   lazy query [?? x 6]
#> # Database: Teradata
#>   TD_TIMECODE         TD_SEQNO buoyid salinity temperature dates     
#>   <dttm>                 <int>  <int>    <int>       <int> <date>    
#> 1 2014-01-06 08:10:00       19      0       55          10 2015-07-19
#> 2 2014-01-06 08:00:00       26      0       55          10 2016-02-26

Notes

  • NULLs/NAs are not included in the result computation.
  • Teradata recommends to strictly use max, but not time series aggregate ts.max, for regular aggregate operations and window aggregate operations.
  • The use of distinct is not permitted in window aggregate functions.

Sample Variance : var()

The aggregate function var returns the sample variance of values of the column. The variance of a sample is a measure of dispersion from the mean of that sample. It is the square of the sample standard deviation. For more information on var aggregate function, please refer to Teradata SQL translation vignette. Use var(distinct(column_name)) to exclude duplicate rows while calculating sample variance.

# Calculate the sample variance of the 'temperature' column grouped by 'buoyid' column.
# Please note that the output of the function remains the same (excludes NULL/NA values in computation) irrespective of the argument 'na.rm' set to TRUE. Warning 'Missing values are always removed in SQL.' will be suppressed if this is set to TRUE, otherwise the warning is raised.
df_var <- df_seq %>% group_by(buoyid) %>% summarise(var_temp = var(temperature, na.rm = TRUE))

# Print the results.
df_var %>% arrange(buoyid)
#> # Source:     lazy query [?? x 2]
#> # Database:   Teradata
#> # Ordered by: buoyid
#>   buoyid var_temp
#>    <int>    <dbl>
#> 1      0   2670. 
#> 2      1     15.5
#> 3      2      1  
#> 4     22     NA  
#> 5     44     33.2

# Exclude duplicates in the same aggregate operation above.
df_var <- df_seq %>% group_by(buoyid) %>% summarise(var_temp = var(distinct(temperature)))

# Print the results.
df_var %>% arrange(buoyid)
#> Warning: Missing values are always removed in SQL.
#> Use `VAR_SAMP(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> # Source:     lazy query [?? x 2]
#> # Database:   Teradata
#> # Ordered by: buoyid
#>   buoyid var_temp
#>    <int>    <dbl>
#> 1      0   2670. 
#> 2      1     15.5
#> 3      2      1  
#> 4     22     NA  
#> 5     44     27.7

# Calculate the sample variance of the values in the 'temperature' column without any grouping.
df_var <- df_seq %>% summarise(var_temp = var(temperature))

# Print the results.
df_var
#> Warning: Missing values are always removed in SQL.
#> Use `VAR_SAMP(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> # Source:   lazy query [?? x 1]
#> # Database: Teradata
#>   var_temp
#>      <dbl>
#> 1     548.

# Exclude duplicates in the same aggregate operation above.
df_var <- df_seq %>% summarise(var_temp = var(distinct(temperature)))

# Print the results.
df_var
#> Warning: Missing values are always removed in SQL.
#> Use `VAR_SAMP(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> # Source:   lazy query [?? x 1]
#> # Database: Teradata
#>   var_temp
#>      <dbl>
#> 1     558.

The function var can be used as window aggregates using dplyr verbs like mutate, filter.

# Create a new column 'var_t' which holds the sample variance of the values in the 'temperature' column.
df_var <- df_seq %>% mutate(var_t = var(temperature))

# Print the query.
df_var %>% show_query()
#> <SQL>
#> Warning: Missing values are always removed in SQL.
#> Use `VAR_SAMP(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> SELECT "TD_TIMECODE", "TD_SEQNO", "buoyid", "salinity", "temperature", "dates", VAR_SAMP("temperature") OVER () AS "var_t"
#> FROM "ocean_buoys_seq"

# Print the results.
df_var %>% arrange(TD_TIMECODE, TD_SEQNO)
#> # Source:     lazy query [?? x 7]
#> # Database:   Teradata
#> # Ordered by: TD_TIMECODE, TD_SEQNO
#>   TD_TIMECODE         TD_SEQNO buoyid salinity temperature dates      var_t
#>   <dttm>                 <int>  <int>    <int>       <int> <date>     <dbl>
#> 1 2014-01-06 08:00:00       26      0       55          10 2016-02-26  548.
#> 2 2014-01-06 08:08:59       18      0       55          NA 2015-06-18  548.
#> 3 2014-01-06 08:09:59       17      0       55          99 2015-05-17  548.
#> 4 2014-01-06 08:10:00       19      0       55          10 2015-07-19  548.
#> 5 2014-01-06 08:10:00       27      0       55         100 2016-03-27  548.
#> 6 2014-01-06 09:01:25       11      1       55          70 2014-11-11  548.
#> # ... with more rows

# Filter the rows with temperature values less than one-tenth of the sample variance of the values in the 'temperature' column.
df_var <- df_seq %>% filter(temperature < var(temperature)/10)

# Print the results.
df_var
#> # Source:   lazy query [?? x 6]
#> # Database: Teradata
#>   TD_TIMECODE         TD_SEQNO buoyid salinity temperature dates     
#>   <dttm>                 <int>  <int>    <int>       <int> <date>    
#> 1 2014-01-06 10:00:24        4     44       55          43 2014-04-04
#> 2 2014-01-06 10:00:25        6     44       55          43 2014-06-06
#> 3 2014-01-06 10:00:26        7     44       55          43 2014-07-07
#> 4 2014-01-06 10:01:25        8     44       55          53 2014-08-08
#> 5 2014-01-06 10:00:24        5     44       55          43 2014-05-05
#> 6 2014-01-06 08:10:00       19      0       55          10 2015-07-19
#> # ... with more rows

Notes

  • NULLs/NAs are not included in the result computation.
  • Teradata recommends to strictly use var for regular aggregate and window operations, instead of time series aggregate function ts.var.
  • When there are fewer than two non-NULL data points in the sample used for the computation, var returns NULL/NA.
  • Division by zero results in NULL/NA value rather than an error.
  • If data represents only a sample of the entire population for the column, Teradata recommends to use var to calculate sample variance instead of varp which calculates population variance. As the sample size increases, the values for var and varp approach the same number.
  • The use of distinct is not permitted in window aggregate functions.

Population Variance : varp()

The aggregate function varp returns the population variance of values of the column. The variance of a population is a measure of dispersion from the mean of that population. For more information on varp aggregate function, please refer to Teradata SQL translation vignette. Use varp(distinct(column_name)) to exclude duplicate rows while calculating population variance.

# Calculate the population variance of the 'temperature' column grouped by 'buoyid' column.
# Please note that the output of the function remains the same (excludes NULL/NA values in computation) irrespective of the argument 'na.rm' set to TRUE. Warning 'Missing values are always removed in SQL.' will be suppressed if this is set to TRUE, otherwise the warning is raised.
df_varp <- df_seq %>% group_by(buoyid) %>% summarise(varp_temp = varp(temperature, na.rm = TRUE))

# Print the results.
df_varp %>% arrange(buoyid)
#> # Source:     lazy query [?? x 2]
#> # Database:   Teradata
#> # Ordered by: buoyid
#>   buoyid varp_temp
#>    <int>     <dbl>
#> 1      0  2003.   
#> 2      1    12.9  
#> 3      2     0.667
#> 4     22     0    
#> 5     44    30.7

# Exclude duplicates in the same aggregate operation above.
df_varp <- df_seq %>% group_by(buoyid) %>% summarise(varp_temp = varp(distinct(temperature)))

# Print the results.
df_varp %>% arrange(buoyid)
#> Warning: Missing values are always removed in SQL.
#> Use `VAR_POP(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> # Source:     lazy query [?? x 2]
#> # Database:   Teradata
#> # Ordered by: buoyid
#>   buoyid varp_temp
#>    <int>     <dbl>
#> 1      0  1780.   
#> 2      1    12.9  
#> 3      2     0.667
#> 4     22     0    
#> 5     44    22.2

# Calculate the population variance of the values in the 'temperature' column without any grouping.
df_varp <- df_seq %>% summarise(varp_temp = varp(temperature))

# Print the results.
df_varp
#> Warning: Missing values are always removed in SQL.
#> Use `VAR_POP(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> # Source:   lazy query [?? x 1]
#> # Database: Teradata
#>   varp_temp
#>       <dbl>
#> 1      528.

# Exclude duplicates in the same aggregate operation above.
df_varp <- df_seq %>% summarise(varp_temp = varp(distinct(temperature)))

# Print the results.
df_varp
#> Warning: Missing values are always removed in SQL.
#> Use `VAR_POP(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> # Source:   lazy query [?? x 1]
#> # Database: Teradata
#>   varp_temp
#>       <dbl>
#> 1      527.

The function varp can be used as window aggregates using dplyr verbs like mutate, filter.

# Create a new column 'var_t' which holds the population variance of the values in the 'temperature' column.
df_varp <- df_seq %>% mutate(var_t = varp(temperature))

# Print the query.
df_varp %>% show_query()
#> <SQL>
#> Warning: Missing values are always removed in SQL.
#> Use `VAR_POP(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> SELECT "TD_TIMECODE", "TD_SEQNO", "buoyid", "salinity", "temperature", "dates", VAR_POP("temperature") OVER () AS "var_t"
#> FROM "ocean_buoys_seq"

# Print the results.
df_varp %>% arrange(TD_TIMECODE, TD_SEQNO)
#> Warning: Missing values are always removed in SQL.
#> Use `VAR_POP(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> # Source:     lazy query [?? x 7]
#> # Database:   Teradata
#> # Ordered by: TD_TIMECODE, TD_SEQNO
#>   TD_TIMECODE         TD_SEQNO buoyid salinity temperature dates      var_t
#>   <dttm>                 <int>  <int>    <int>       <int> <date>     <dbl>
#> 1 2014-01-06 08:00:00       26      0       55          10 2016-02-26  528.
#> 2 2014-01-06 08:08:59       18      0       55          NA 2015-06-18  528.
#> 3 2014-01-06 08:09:59       17      0       55          99 2015-05-17  528.
#> 4 2014-01-06 08:10:00       19      0       55          10 2015-07-19  528.
#> 5 2014-01-06 08:10:00       27      0       55         100 2016-03-27  528.
#> 6 2014-01-06 09:01:25       11      1       55          70 2014-11-11  528.
#> # ... with more rows

# Filter the rows with temperature values less than one-tenth of the population variance of the values in the 'temperature' column.
df_varp <- df_seq %>% filter(temperature < varp(temperature)/10)

# Print the results.
df_varp
#> Warning: Missing values are always removed in SQL.
#> Use `VAR_POP(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> # Source:   lazy query [?? x 6]
#> # Database: Teradata
#>   TD_TIMECODE         TD_SEQNO buoyid salinity temperature dates     
#>   <dttm>                 <int>  <int>    <int>       <int> <date>    
#> 1 2014-01-06 10:00:24        4     44       55          43 2014-04-04
#> 2 2014-01-06 10:00:25        6     44       55          43 2014-06-06
#> 3 2014-01-06 10:00:26        7     44       55          43 2014-07-07
#> 4 2014-01-06 10:12:00        3     44       55          43 2014-03-03
#> 5 2014-01-06 10:00:24        5     44       55          43 2014-05-05
#> 6 2014-01-06 08:10:00       19      0       55          10 2015-07-19
#> # ... with more rows

Notes

  • NULLs/NAs are not included in the result computation.
  • Teradata recommends to strictly use varp for regular aggregate and window operations, instead of time series aggregate function ts.varp.
  • When there are fewer than two non-NULL data points in the population used for the computation, varp returns NULL/NA.
  • Division by zero results in NULL/NA value rather than an error.
  • varp can only be used if data represents entire population. Otherwise, Teradata recommends to use var to calculate sample variance.
  • The use of distinct is not permitted in window aggregate functions.

Average : mean()

The aggregate function mean returns the average value in the column. For more information on mean aggregate function, please refer to Teradata SQL translation vignette. Use mean(distinct(column_name)) to exclude duplicate rows while calculating average.

# Calculate the average value in the 'temperature' column grouped by 'buoyid' column.
# Please note that the output of the function remains the same (excludes NULL/NA values in computation) irrespective of the argument 'na.rm' set to TRUE. Warning 'Missing values are always removed in SQL.' will be suppressed if this is set to TRUE, otherwise the warning is raised.
df_avg <- df_seq %>% group_by(buoyid) %>% summarise(avg_temp = mean(temperature, na.rm = TRUE))

# Print the results.
df_avg %>% arrange(buoyid)
#> # Source:     lazy query [?? x 2]
#> # Database:   Teradata
#> # Ordered by: buoyid
#>   buoyid avg_temp
#>    <int>    <dbl>
#> 1      0     54.8
#> 2      1     74.5
#> 3      2     81  
#> 4     22     23  
#> 5     44     48.1

# Exclude duplicates in the same aggregate operation above.
df_avg <- df_seq %>% group_by(buoyid) %>% summarise(avg_temp = mean(distinct(temperature)))

# Print the results.
df_avg %>% arrange(buoyid)
#> Warning: Missing values are always removed in SQL.
#> Use `mean(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> # Source:     lazy query [?? x 2]
#> # Database:   Teradata
#> # Ordered by: buoyid
#>   buoyid avg_temp
#>    <int>    <dbl>
#> 1      0     69.7
#> 2      1     74.5
#> 3      2     81  
#> 4     22     23  
#> 5     44     52.2

# Calculate the average value in the 'temperature' column without any grouping.
df_avg <- df_seq %>% summarise(avg_temp = mean(temperature))

# Print the results.
df_avg
#> # Source:   lazy query [?? x 1]
#> # Database: Teradata
#>   avg_temp
#>      <dbl>
#> 1     57.7

# Exclude duplicates in the same aggregate operation above.
df_avg <- df_seq %>% summarise(avg_temp = mean(distinct(temperature)))

# Print the results.
df_avg
#> # Source:   lazy query [?? x 1]
#> # Database: Teradata
#>   avg_temp
#>      <dbl>
#> 1     65.7

The function mean can be used as window aggregates using dplyr verbs like mutate, filter.

# Create a new column 'avg_t' which holds the average value in the 'temperature' column.
df_avg <- df_seq %>% mutate(avg_t = mean(temperature))

# Print the query.
df_avg %>% show_query()
#> <SQL>
#> Warning: Missing values are always removed in SQL.
#> Use `AVG(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> SELECT "TD_TIMECODE", "TD_SEQNO", "buoyid", "salinity", "temperature", "dates", AVG("temperature") OVER () AS "avg_t"
#> FROM "ocean_buoys_seq"

# Print the results.
df_avg %>% arrange(TD_TIMECODE, TD_SEQNO)
#> # Source:     lazy query [?? x 7]
#> # Database:   Teradata
#> # Ordered by: TD_TIMECODE, TD_SEQNO
#>   TD_TIMECODE         TD_SEQNO buoyid salinity temperature dates      avg_t
#>   <dttm>                 <int>  <int>    <int>       <int> <date>     <dbl>
#> 1 2014-01-06 08:00:00       26      0       55          10 2016-02-26  57.7
#> 2 2014-01-06 08:08:59       18      0       55          NA 2015-06-18  57.7
#> 3 2014-01-06 08:09:59       17      0       55          99 2015-05-17  57.7
#> 4 2014-01-06 08:10:00       19      0       55          10 2015-07-19  57.7
#> 5 2014-01-06 08:10:00       27      0       55         100 2016-03-27  57.7
#> 6 2014-01-06 09:01:25       11      1       55          70 2014-11-11  57.7
#> # ... with more rows

# Filter the rows with temperature values less than one-fifth of the average value in 'temperature' column.
df_avg <- df_seq %>% filter(temperature < mean(temperature)/5)

# Print the results.
df_avg
#> # Source:   lazy query [?? x 6]
#> # Database: Teradata
#>   TD_TIMECODE         TD_SEQNO buoyid salinity temperature dates     
#>   <dttm>                 <int>  <int>    <int>       <int> <date>    
#> 1 2014-01-06 08:10:00       19      0       55          10 2015-07-19
#> 2 2014-01-06 08:00:00       26      0       55          10 2016-02-26

Notes

  • NULLs/NAs are not included in the result computation.
  • Teradata recommends to strictly use mean, but not time series aggregate ts.mean, for regular aggregate operations and window aggregate operations.
  • The use of distinct is not permitted in window aggregate functions.

Count : n() and n_distinct()

The aggregate function n returns the total number of qualified rows in the column. n uses one argument named x, which specifies the column for which total number of rows is to be computed. For more information on n aggregate function, please refer to Teradata SQL translation vignette. Use n(distinct(column_name)) to exclude duplicate rows while calculating the number of rows. Another way of accessing distinct rows is n_distinct(column_name).

# Calculate the number of rows in the 'temperature' column grouped by 'buoyid' column.
df_count <- df_seq %>% group_by(buoyid) %>% summarise(count_temp = n(temperature))

# Print the results.
df_count %>% arrange(buoyid)
#> # Source:     lazy query [?? x 2]
#> # Database:   Teradata
#> # Ordered by: buoyid
#>   buoyid count_temp
#>    <int> <int64>   
#> 1      0  4        
#> 2      1  6        
#> 3      2  3        
#> 4     22  1        
#> 5     44 13

# Exclude duplicates in the same aggregate operation above.
df_count <- df_seq %>% group_by(buoyid) %>% summarise(count_temp = n(distinct(temperature)))

# Print the results.
df_count %>% arrange(buoyid)
#> # Source:     lazy query [?? x 2]
#> # Database:   Teradata
#> # Ordered by: buoyid
#>   buoyid count_temp
#>    <int> <int64>   
#> 1      0 3         
#> 2      1 6         
#> 3      2 3         
#> 4     22 1         
#> 5     44 5

# Another way of excluding duplicates in the same aggregate operation above.
df_count <- df_seq %>% group_by(buoyid) %>% summarise(count_temp = n_distinct(temperature))

# Print the results.
df_count %>% arrange(buoyid)
#> # Source:     lazy query [?? x 2]
#> # Database:   Teradata
#> # Ordered by: buoyid
#>   buoyid count_temp
#>    <int> <int64>   
#> 1      0 3         
#> 2      1 6         
#> 3      2 3         
#> 4     22 1         
#> 5     44 5

# Calculate the number of rows in the table grouped by 'buoyid' column.
# Note that the argument is not provided for the function `n()`.
df_count <- df_seq %>% group_by(buoyid) %>% summarise(count_temp = n())

# Print the results.
df_count %>% arrange(buoyid)
#> # Source:     lazy query [?? x 2]
#> # Database:   Teradata
#> # Ordered by: buoyid
#>   buoyid count_temp
#>    <int> <int64>   
#> 1      0  5        
#> 2      1  6        
#> 3      2  3        
#> 4     22  1        
#> 5     44 13

# Calculate the number of rows in the 'temperature' column without any grouping.
df_count <- df_seq %>% summarise(count_temp = n(temperature))

# Print the results.
df_count
#> # Source:   lazy query [?? x 1]
#> # Database: Teradata
#>   count_temp
#>   <int64>   
#> 1 27

# Exclude duplicates in the same aggregate operation above.
df_count <- df_seq %>% summarise(count_temp = n(distinct(temperature)))

# Print the results.
df_count
#> # Source:   lazy query [?? x 1]
#> # Database: Teradata
#>   count_temp
#>   <int64>   
#> 1 18

# Calculate the number of rows in the table without any grouping.
# Note that the argument is not provided for the function `n()`.
df_count <- df_seq %>% summarise(count_temp = n())

# Print the results.
df_count
#> # Source:   lazy query [?? x 1]
#> # Database: Teradata
#>   count_temp
#>   <int64>   
#> 1 28

The function n can be used as window aggregates using dplyr verbs like mutate, filter. However, the n_distinct aggregate function cannot be used with

# Create a new column 'count_t' which holds the number of rows in the 'temperature' column.
df_count <- df_seq %>% mutate(count_t = n(temperature))

# Print the query.
df_count %>% show_query()
#> <SQL>
#> SELECT "TD_TIMECODE", "TD_SEQNO", "buoyid", "salinity", "temperature", "dates", CAST(COUNT("temperature") OVER () AS BIGINT) AS "count_t"
#> FROM "ocean_buoys_seq"

# Print the results.
df_count %>% select(TD_TIMECODE, TD_SEQNO, buoyid, count_t) %>% arrange(TD_TIMECODE, TD_SEQNO)
#> # Source:     lazy query [?? x 4]
#> # Database:   Teradata
#> # Ordered by: TD_TIMECODE, TD_SEQNO
#>   TD_TIMECODE         TD_SEQNO buoyid count_t
#>   <dttm>                 <int>  <int> <int64>
#> 1 2014-01-06 08:00:00       26      0 27     
#> 2 2014-01-06 08:08:59       18      0 27     
#> 3 2014-01-06 08:09:59       17      0 27     
#> 4 2014-01-06 08:10:00       19      0 27     
#> 5 2014-01-06 08:10:00       27      0 27     
#> 6 2014-01-06 09:01:25       11      1 27     
#> # ... with more rows

# Create a new column 'count_t' which holds the number of rows in the table.
# Note that the argument is not provided for the function `n()`.
df_count <- df_seq %>% mutate(count_t = n())

# Print the query.
df_count %>% show_query()
#> <SQL>
#> SELECT "TD_TIMECODE", "TD_SEQNO", "buoyid", "salinity", "temperature", "dates", CAST(COUNT(*) OVER () AS BIGINT) AS "count_t"
#> FROM "ocean_buoys_seq"

# Print the results.
df_count %>% select(TD_TIMECODE, TD_SEQNO, buoyid, count_t) %>% arrange(TD_TIMECODE, TD_SEQNO)
#> # Source:     lazy query [?? x 4]
#> # Database:   Teradata
#> # Ordered by: TD_TIMECODE, TD_SEQNO
#>   TD_TIMECODE         TD_SEQNO buoyid count_t
#>   <dttm>                 <int>  <int> <int64>
#> 1 2014-01-06 08:00:00       26      0 28     
#> 2 2014-01-06 08:08:59       18      0 28     
#> 3 2014-01-06 08:09:59       17      0 28     
#> 4 2014-01-06 08:10:00       19      0 28     
#> 5 2014-01-06 08:10:00       27      0 28     
#> 6 2014-01-06 09:01:25       11      1 28     
#> # ... with more rows

# Filter the rows with temperature values less than the number of rows in 'temperature' column.
df_count <- df_seq %>% filter(temperature < n(temperature))

# Print the results.
df_count %>% arrange(TD_TIMECODE, TD_SEQNO)
#> # Source:     lazy query [?? x 6]
#> # Database:   Teradata
#> # Ordered by: TD_TIMECODE, TD_SEQNO
#>   TD_TIMECODE         TD_SEQNO buoyid salinity temperature dates     
#>   <dttm>                 <int>  <int>    <int>       <int> <date>    
#> 1 2014-01-06 08:00:00       26      0       55          10 2016-02-26
#> 2 2014-01-06 08:10:00       19      0       55          10 2015-07-19
#> 3 2014-01-06 10:32:12        1     22       25          23 2014-01-01

Notes

  • NULLs/NAs are not included in the result computation.
  • n and n_distinct are valid on all columns with any data type.
  • Teradata recommends to strictly use n, but not time series aggregate ts.n, for regular aggregate operations and window aggregate operations.
  • The use of distinct is not permitted in window aggregate functions i.e., n(distinct(column_name)) and n_distinct(column_name) cannot be used as window aggregate operations.