Introduction to Regular Aggregates
The regular aggregate functions help perform aggregate operations on non-time series data. The goal of this vignette is to provide introductory examples to get the user familiar with regular aggregate functions.
Each aggregate function has a section in this vignette. Each section can be read in isolation, except where references are made to other sections or other vignettes. Each section also has a subsection called “Things to Note” for good practices and specific behavior to consider when using the aggregate functions with Teradata Vantage.
Once the connection is established using td_create_context, you can load the example table ocean_buoys_seq
and create respective tibble using the following statements:
loadExampleData("time_series_example", "ocean_buoys_seq")
# Create tbl_teradata objects.
df_seq <- tbl(con, "ocean_buoys_seq")
Median : median()
The aggregate function median
returns the median of all values in each group. The function returns the average of the two middle values if the argument value.expression
contains an even number of values. median
uses one argument named value.expression
, which specifies the column for which median is to be computed.
# Calculate the median value of the 'temperature' column grouped by 'buoyid' column.
df_median <- df_seq %>% group_by(buoyid) %>% summarise(median_temp = median(temperature))
# Print the results.
df_median %>% arrange(buoyid)
#> # Source: lazy query [?? x 2]
#> # Database: Teradata
#> # Ordered by: buoyid
#> buoyid median_temp
#> <int> <int>
#> 1 0 54
#> 2 1 74
#> 3 2 81
#> 4 22 23
#> 5 44 43
# Calculate the median value of the 'temperature' column without any grouping.
df_median <- df_seq %>% summarise(median_temp = median(temperature))
# Print the results.
df_median
#> # Source: lazy query [?? x 1]
#> # Database: Teradata
#> median_temp
#> <int>
#> 1 54
Notes
- The column containing median values is of NUMERIC data type.
- NULLs/NAs are not included in the result computation.
median
is valid only for numeric data.- Use of
distinct
to exclude duplicates is not allowed formedian
function. This will throw an error as shown in the example 1 below. However, duplicate values can be excluded in the computation for time series aggregate functionts.median
. For more information on time series aggregation functionts.median
, please refer to Time Series Aggregates vignettes. - Teradata recommends to strictly use the function
median(value.expression)
for regular aggregate operation with/without grouping. - The function
median
cannot be used with dplyr verbs likemutate
,filter
as shown in the example 2 below.
# Example 1
# Using 'distinct()' to calculate median on distinct temperature values.
df_median <- df_seq %>% group_by(buoyid) %>% summarise(median_temp = median(distinct(temperature)))
# Print the results.
# This throws an error "Illegal use of DISTINCT Aggregate Expressions."
df_median
# Example 2
# Incorrect usage of median() with mutate and filter.
df_median <- df_seq %>% mutate(median_t = median(temperature))
df_median <- df_seq %>% filter(temperature <= median(temperature))
Kurtosis : kurtosis()
The aggregate function kurtosis
measures the tailedness of the probability distribution of a column. Kurtosis is the fourth moment of the distribution of the standardized (z) values. It is a measure of the outlier (rare, extreme observation) character of the distribution as compared to the normal (Gaussian) distribution.
- The normal distribution has a kurtosis of 0.
- Positive kurtosis indicates that the distribution is more outlier-prone (deviation from the mean) than the normal distribution.
- Negative kurtosis indicates that the distribution is less outlier-prone (deviation from the mean) than the normal distribution.
kurtosis
uses one argument named value.expression
, which specifies the column for which kurtosis is to be computed. Use kurtosis(distinct(column_name))
to exclude duplicate rows while calculating kurtosis.
# Calculate the kurtosis value of the 'temperature' column grouped by 'buoyid' column.
df_kurtosis <- df_seq %>% group_by(buoyid) %>% summarise(kurtosis_temp = kurtosis(temperature))
# Print the results.
df_kurtosis %>% arrange(buoyid)
#> # Source: lazy query [?? x 2]
#> # Database: Teradata
#> # Ordered by: buoyid
#> buoyid kurtosis_temp
#> <int> <dbl>
#> 1 0 -6.00
#> 2 1 -2.76
#> 3 2 NA
#> 4 22 NA
#> 5 44 -2.20
# Exclude duplicates in the same aggregate operation above.
df_kurtosis <- df_seq %>% group_by(buoyid) %>% summarise(kurtosis_temp = kurtosis(distinct(temperature)))
# Print the results.
df_kurtosis %>% arrange(buoyid)
#> # Source: lazy query [?? x 2]
#> # Database: Teradata
#> # Ordered by: buoyid
#> buoyid kurtosis_temp
#> <int> <dbl>
#> 1 0 NA
#> 2 1 -2.76
#> 3 2 NA
#> 4 22 NA
#> 5 44 4.13
# Calculate the kurtosis value of the 'temperature' column without any grouping.
df_kurtosis <- df_seq %>% summarise(kurtosis_temp = kurtosis(temperature))
# Print the results.
df_kurtosis
#> # Source: lazy query [?? x 1]
#> # Database: Teradata
#> kurtosis_temp
#> <dbl>
#> 1 -0.198
# Exclude duplicates in the same aggregate operation above.
df_kurtosis <- df_seq %>% summarise(kurtosis_temp = kurtosis(distinct(temperature)))
# Print the results.
df_kurtosis
#> # Source: lazy query [?? x 1]
#> # Database: Teradata
#> kurtosis_temp
#> <dbl>
#> 1 0.742
Notes
- NULLs/NAs are not included in the result computation.
- The time series aggregate function
ts.kurtosis
can also be used with regular aggregate operation i.e., the functionsts.kurtosis
andkurtosis
can be used interchangeably. For more information on time series aggregation functionts.kurtosis
, please refer to Time Series Aggregates vignette. However, Teradata recommends to strictly usekurtosis
for regular aggregate operation. - Following conditions will produce NULL result:
- Fewer than three non-NULL data points in the data used for the computation.
- Standard deviation for a column is equal to 0.
kurtosis
cannot be used with dplyr verbs likemutate
,filter
as shown in the example below.
# Incorrect usage of kurtosis() with mutate and filter.
df_kurtosis <- df_seq %>% mutate(kurtosis_t = kurtosis(temperature))
df_kurtosis <- df_seq %>% filter(temperature <= kurtosis(temperature))
Skew : skew()
The aggregate function skew
measures the skewness of the distribution of a column. Skewness is the third moment of a distribution. It is a measure of the asymmetry of the distribution about its mean compared to the normal (Gaussian) distribution.
- The normal distribution has a skewness of 0.
- Positive skewness indicates the distribution having an asymmetric tail extending toward more positive values.
- Negative skewness indicates the distribution having an asymmetric tail extending toward more negative values.
skew
uses one argument named value.expression
, which specifies the column for which skewness is to be computed. Use skew(distinct(column_name))
to exclude duplicate rows while calculating skew.
# Calculate the skewness of the 'temperature' column grouped by 'buoyid' column.
df_skew <- df_seq %>% group_by(buoyid) %>% summarise(skew_temp = skew(temperature))
# Print the results.
df_skew %>% arrange(buoyid)
#> # Source: lazy query [?? x 2]
#> # Database: Teradata
#> # Ordered by: buoyid
#> buoyid skew_temp
#> <int> <dbl>
#> 1 0 0.000324
#> 2 1 0
#> 3 2 0
#> 4 22 NA
#> 5 44 0.246
# Exclude duplicates in the same aggregate operation above.
df_skew <- df_seq %>% group_by(buoyid) %>% summarise(skew_temp = skew(distinct(temperature)))
# Print the results.
df_skew %>% arrange(buoyid)
#> # Source: lazy query [?? x 2]
#> # Database: Teradata
#> # Ordered by: buoyid
#> buoyid skew_temp
#> <int> <dbl>
#> 1 0 -1.73
#> 2 1 0
#> 3 2 0
#> 4 22 NA
#> 5 44 -1.99
# Calculate the skewness of the 'temperature' column without any grouping.
df_skew <- df_seq %>% summarise(skew_temp = skew(temperature))
# Print the results.
df_skew
#> # Source: lazy query [?? x 1]
#> # Database: Teradata
#> skew_temp
#> <dbl>
#> 1 -0.205
# Exclude duplicates in the same aggregate operation above.
df_skew <- df_seq %>% summarise(skew_temp = skew(distinct(temperature)))
# Print the results.
df_skew
#> # Source: lazy query [?? x 1]
#> # Database: Teradata
#> skew_temp
#> <dbl>
#> 1 -0.877
Notes
- NULLs/NAs are not included in the result computation.
- The time series aggregate function
ts.skew
can also be used with regular aggregate operation i.e., the functionsts.skew
andskew
can be used interchangeably. For more information on time series aggregation functionts.skew
, please refer to Time Series Aggregates vignette. However, Teradata recommends to strictly useskew
for regular aggregate operation. - Following conditions will produce NULL result:
- Fewer than three non-NULL data points in the data used for the computation.
- Standard deviation for a column is equal to 0.
skew
cannot be used with dplyr verbs likemutate
,filter
as shown in the example below.
# Incorrect usage of skew() with mutate and filter.
df_skew <- df_seq %>% mutate(skew_t = skew(temperature))
df_skew <- df_seq %>% filter(temperature <= skew(temperature))
Sum : sum()
The aggregate function sum
returns the sum of values in the column. For more information on sum
aggregate function, please refer to Teradata SQL translation vignette. Use sum(distinct(column_name))
to exclude duplicate rows while calculating sum.
# Calculate the sum of the values in the 'temperature' column grouped by 'buoyid' column.
# Please note that the output of the function remains the same (excludes NULL/NA values in computation) irrespective of the argument 'na.rm' set to TRUE. Warning 'Missing values are always removed in SQL.' will be suppressed if this is set to TRUE, otherwise the warning is raised.
df_sum <- df_seq %>% group_by(buoyid) %>% summarise(sum_temp = sum(temperature, na.rm = TRUE))
# Print the results.
df_sum %>% arrange(buoyid)
#> # Source: lazy query [?? x 2]
#> # Database: Teradata
#> # Ordered by: buoyid
#> buoyid sum_temp
#> <int> <int>
#> 1 0 219
#> 2 1 447
#> 3 2 243
#> 4 22 23
#> 5 44 625
# Exclude duplicates in the same aggregate operation above.
df_sum <- df_seq %>% group_by(buoyid) %>% summarise(sum_temp = sum(distinct(temperature)))
# Print the results.
df_sum %>% arrange(buoyid)
#> Warning: Missing values are always removed in SQL.
#> Use `SUM(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> # Source: lazy query [?? x 2]
#> # Database: Teradata
#> # Ordered by: buoyid
#> buoyid sum_temp
#> <int> <int>
#> 1 0 209
#> 2 1 447
#> 3 2 243
#> 4 22 23
#> 5 44 261
# Calculate the sum of the values in the 'temperature' column without any grouping.
df_sum <- df_seq %>% summarise(sum_temp = sum(temperature))
# Print the results.
df_sum
#> # Source: lazy query [?? x 1]
#> # Database: Teradata
#> sum_temp
#> <int>
#> 1 1557
# Exclude duplicates in the same aggregate operation above.
df_sum <- df_seq %>% summarise(sum_temp = sum(distinct(temperature)))
# Print the results.
df_sum
#> # Source: lazy query [?? x 1]
#> # Database: Teradata
#> sum_temp
#> <int>
#> 1 1183
The function sum
can be used as window aggregates using dplyr verbs like mutate
, filter
.
# Create a new column 'sum_t' which holds the sum of values in the 'temperature' column.
df_sum <- df_seq %>% mutate(sum_t = sum(temperature))
# Print the query.
df_sum %>% show_query()
#> <SQL>
#> Warning: Missing values are always removed in SQL.
#> Use `SUM(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> SELECT "TD_TIMECODE", "TD_SEQNO", "buoyid", "salinity", "temperature", "dates", SUM("temperature") OVER () AS "sum_t"
#> FROM "ocean_buoys_seq"
# Print the results.
df_sum %>% arrange(TD_TIMECODE, TD_SEQNO)
#> # Source: lazy query [?? x 7]
#> # Database: Teradata
#> # Ordered by: TD_TIMECODE, TD_SEQNO
#> TD_TIMECODE TD_SEQNO buoyid salinity temperature dates sum_t
#> <dttm> <int> <int> <int> <int> <date> <int>
#> 1 2014-01-06 08:00:00 26 0 55 10 2016-02-26 1557
#> 2 2014-01-06 08:08:59 18 0 55 NA 2015-06-18 1557
#> 3 2014-01-06 08:09:59 17 0 55 99 2015-05-17 1557
#> 4 2014-01-06 08:10:00 19 0 55 10 2015-07-19 1557
#> 5 2014-01-06 08:10:00 27 0 55 100 2016-03-27 1557
#> 6 2014-01-06 09:01:25 11 1 55 70 2014-11-11 1557
#> # ... with more rows
# Filter the rows with temperature values less than one-hundredth of the sum of values in the 'temperature' column.
df_sum <- df_seq %>% filter(temperature < sum(temperature)/100)
# Print the results.
df_sum
#> Warning: `lang_name()` is deprecated as of rlang 0.2.0.
#> Please use `call_name()` instead.
#> This warning is displayed once per session.
#> Warning: `lang()` is deprecated as of rlang 0.2.0.
#> Please use `call2()` instead.
#> This warning is displayed once per session.
#> # Source: lazy query [?? x 6]
#> # Database: Teradata
#> TD_TIMECODE TD_SEQNO buoyid salinity temperature dates
#> <dttm> <int> <int> <int> <int> <date>
#> 1 2014-01-06 08:10:00 19 0 55 10 2015-07-19
#> 2 2014-01-06 08:00:00 26 0 55 10 2016-02-26
Notes
- NULLs/NAs are not included in the result computation.
- Teradata recommends to strictly use
sum
, but not time series aggregatets.sum
, for regular aggregate operations and window aggregate operations. - The use of
distinct
is not permitted in window aggregate functions.
Sample Standard Deviation : sd()
The aggregate function sd
returns the sample standard deviation of values of the column. The standard deviation is the second moment of a distribution. For more information on sd
aggregate function, please refer to Teradata SQL translation vignette. Use sd(distinct(column_name))
to exclude duplicate rows while calculating sample standard deviation.
# Calculate the sample standard deviation values of the 'temperature' column grouped by 'buoyid' column.
# Please note that the output of the function remains the same (excludes NULL/NA values in computation) irrespective of the argument 'na.rm' set to TRUE. Warning 'Missing values are always removed in SQL.' will be suppressed if this is set to TRUE, otherwise the warning is raised.
df_sd <- df_seq %>% group_by(buoyid) %>% summarise(sd_temp = sd(temperature, na.rm = TRUE))
# Print the results.
df_sd %>% arrange(buoyid)
#> # Source: lazy query [?? x 2]
#> # Database: Teradata
#> # Ordered by: buoyid
#> buoyid sd_temp
#> <int> <dbl>
#> 1 0 51.7
#> 2 1 3.94
#> 3 2 1
#> 4 22 NA
#> 5 44 5.77
# Exclude duplicates in the same aggregate operation above.
df_sd <- df_seq %>% group_by(buoyid) %>% summarise(sd_temp = sd(distinct(temperature)))
# Print the results.
df_sd %>% arrange(buoyid)
#> Warning: Missing values are always removed in SQL.
#> Use `STDDEV_SAMP(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> # Source: lazy query [?? x 2]
#> # Database: Teradata
#> # Ordered by: buoyid
#> buoyid sd_temp
#> <int> <dbl>
#> 1 0 51.7
#> 2 1 3.94
#> 3 2 1
#> 4 22 NA
#> 5 44 5.26
# Calculate the sample standard deviation of the values in the 'temperature' column without any grouping.
df_sd <- df_seq %>% summarise(sd_temp = sd(temperature))
# Print the results.
df_sd
#> Warning: Missing values are always removed in SQL.
#> Use `STDDEV_SAMP(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> # Source: lazy query [?? x 1]
#> # Database: Teradata
#> sd_temp
#> <dbl>
#> 1 23.4
# Exclude duplicates in the same aggregate operation above.
df_sd <- df_seq %>% summarise(sd_temp = sd(distinct(temperature)))
# Print the results.
df_sd
#> Warning: Missing values are always removed in SQL.
#> Use `STDDEV_SAMP(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> # Source: lazy query [?? x 1]
#> # Database: Teradata
#> sd_temp
#> <dbl>
#> 1 23.6
The function sd
can be used as window aggregates using dplyr verbs like mutate
, filter
.
# Create a new column 'sd_t' which holds the sample standard deviation of the values in the 'temperature' column.
df_sd <- df_seq %>% mutate(sd_t = sd(temperature))
# Print the query.
df_sd %>% show_query()
#> <SQL>
#> Warning: Missing values are always removed in SQL.
#> Use `STDDEV_SAMP(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> SELECT "TD_TIMECODE", "TD_SEQNO", "buoyid", "salinity", "temperature", "dates", STDDEV_SAMP("temperature") OVER () AS "sd_t"
#> FROM "ocean_buoys_seq"
# Print the results.
df_sd %>% arrange(TD_TIMECODE, TD_SEQNO)
#> # Source: lazy query [?? x 7]
#> # Database: Teradata
#> # Ordered by: TD_TIMECODE, TD_SEQNO
#> TD_TIMECODE TD_SEQNO buoyid salinity temperature dates sd_t
#> <dttm> <int> <int> <int> <int> <date> <dbl>
#> 1 2014-01-06 08:00:00 26 0 55 10 2016-02-26 23.4
#> 2 2014-01-06 08:08:59 18 0 55 NA 2015-06-18 23.4
#> 3 2014-01-06 08:09:59 17 0 55 99 2015-05-17 23.4
#> 4 2014-01-06 08:10:00 19 0 55 10 2015-07-19 23.4
#> 5 2014-01-06 08:10:00 27 0 55 100 2016-03-27 23.4
#> 6 2014-01-06 09:01:25 11 1 55 70 2014-11-11 23.4
#> # ... with more rows
# Filter the rows with temperature values less than the sample standard deviation of the values in the 'temperature' column.
df_sd <- df_seq %>% filter(temperature < sd(temperature))
# Print the results.
df_sd
#> # Source: lazy query [?? x 6]
#> # Database: Teradata
#> TD_TIMECODE TD_SEQNO buoyid salinity temperature dates
#> <dttm> <int> <int> <int> <int> <date>
#> 1 2014-01-06 10:32:12 1 22 25 23 2014-01-01
#> 2 2014-01-06 08:10:00 19 0 55 10 2015-07-19
#> 3 2014-01-06 08:00:00 26 0 55 10 2016-02-26
Notes
- NULLs/NAs are not included in the result computation.
- Teradata recommends to strictly use
sd
for regular aggregate and window operations, instead of time series aggregate functionts.sd
. - When there are fewer than two non-NULL data points in the sample used for the computation,
sd
returns NULL/NA. - Division by zero results in NULL/NA value rather than an error.
- If data represents only a sample of the entire population for the column, Teradata recommends to use
sd
to calculate sample standard deviation instead ofsdp
which calculates population standard deviation. As the sample size increases, the values forsd
andsdp
approach the same number. - The use of
distinct
is not permitted in window aggregate functions.
Population Standard Deviation : sdp()
The aggregate function sdp
returns the population standard deviation of values of the column. The standard deviation is the second moment of a distribution. For more information on sdp
aggregate function, please refer to Teradata SQL translation vignette. Use sdp(distinct(column_name))
to exclude duplicate rows while calculating population standard deviation.
# Calculate the population standard deviation values of the 'temperature' column grouped by 'buoyid' column.
# Please note that the output of the function remains the same (excludes NULL/NA values in computation) irrespective of the argument 'na.rm' set to TRUE. Warning 'Missing values are always removed in SQL.' will be suppressed if this is set to TRUE, otherwise the warning is raised.
df_sdp <- df_seq %>% group_by(buoyid) %>% summarise(sdp_temp = sdp(temperature, na.rm = TRUE))
# Print the results.
df_sdp %>% arrange(buoyid)
#> # Source: lazy query [?? x 2]
#> # Database: Teradata
#> # Ordered by: buoyid
#> buoyid sdp_temp
#> <int> <dbl>
#> 1 0 44.8
#> 2 1 3.59
#> 3 2 0.816
#> 4 22 0
#> 5 44 5.54
# Exclude duplicates in the same aggregate operation above.
df_sdp <- df_seq %>% group_by(buoyid) %>% summarise(sdp_temp = sdp(distinct(temperature)))
# Print the results.
df_sdp %>% arrange(buoyid)
#> Warning: Missing values are always removed in SQL.
#> Use `STDDEV_POP(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> # Source: lazy query [?? x 2]
#> # Database: Teradata
#> # Ordered by: buoyid
#> buoyid sdp_temp
#> <int> <dbl>
#> 1 0 42.2
#> 2 1 3.59
#> 3 2 0.816
#> 4 22 0
#> 5 44 4.71
# Calculate the population standard deviation of the values in the 'temperature' column without any grouping.
df_sdp <- df_seq %>% summarise(sdp_temp = sdp(temperature))
# Print the results.
df_sdp
#> Warning: Missing values are always removed in SQL.
#> Use `STDDEV_POP(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> # Source: lazy query [?? x 1]
#> # Database: Teradata
#> sdp_temp
#> <dbl>
#> 1 23.0
# Exclude duplicates in the same aggregate operation above.
df_sdp <- df_seq %>% summarise(sdp_temp = sdp(distinct(temperature)))
# Print the results.
df_sdp
#> Warning: Missing values are always removed in SQL.
#> Use `STDDEV_POP(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> # Source: lazy query [?? x 1]
#> # Database: Teradata
#> sdp_temp
#> <dbl>
#> 1 22.9
The function sdp
can be used as window aggregates using dplyr verbs like mutate
, filter
.
# Create a new column 'sd_t' which holds the population standard deviation of the values in the 'temperature' column.
df_sdp <- df_seq %>% mutate(sd_t = sdp(temperature))
# Print the query.
df_sdp %>% show_query()
#> <SQL>
#> Warning: Missing values are always removed in SQL.
#> Use `STDDEV_POP(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> SELECT "TD_TIMECODE", "TD_SEQNO", "buoyid", "salinity", "temperature", "dates", STDDEV_POP("temperature") OVER () AS "sd_t"
#> FROM "ocean_buoys_seq"
# Print the results.
df_sdp %>% arrange(TD_TIMECODE, TD_SEQNO)
#> Warning: Missing values are always removed in SQL.
#> Use `STDDEV_POP(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> # Source: lazy query [?? x 7]
#> # Database: Teradata
#> # Ordered by: TD_TIMECODE, TD_SEQNO
#> TD_TIMECODE TD_SEQNO buoyid salinity temperature dates sd_t
#> <dttm> <int> <int> <int> <int> <date> <dbl>
#> 1 2014-01-06 08:00:00 26 0 55 10 2016-02-26 23.0
#> 2 2014-01-06 08:08:59 18 0 55 NA 2015-06-18 23.0
#> 3 2014-01-06 08:09:59 17 0 55 99 2015-05-17 23.0
#> 4 2014-01-06 08:10:00 19 0 55 10 2015-07-19 23.0
#> 5 2014-01-06 08:10:00 27 0 55 100 2016-03-27 23.0
#> 6 2014-01-06 09:01:25 11 1 55 70 2014-11-11 23.0
#> # ... with more rows
# Filter the rows with temperature values less than the population standard deviation of the values in the 'temperature' column.
df_sdp <- df_seq %>% filter(temperature < sdp(temperature))
# Print the results.
df_sdp
#> Warning: Missing values are always removed in SQL.
#> Use `STDDEV_POP(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> # Source: lazy query [?? x 6]
#> # Database: Teradata
#> TD_TIMECODE TD_SEQNO buoyid salinity temperature dates
#> <dttm> <int> <int> <int> <int> <date>
#> 1 2014-01-06 08:10:00 19 0 55 10 2015-07-19
#> 2 2014-01-06 08:00:00 26 0 55 10 2016-02-26
Notes
- NULLs/NAs are not included in the result computation.
- Teradata recommends to strictly use
sdp
for regular aggregate and window operations, instead of time series aggregate functionts.sdp
. - When there are fewer than two non-NULL data points in the population used for the computation,
sdp
returns NULL/NA. - Division by zero results in NULL/NA value rather than an error.
sdp
can only be used if data represents entire population. Otherwise, Teradata recommends to usesd
to calculate sample standard deviation.- The use of
distinct
is not permitted in window aggregate functions.
Minimum : min()
The aggregate function min
returns the minimum value in the column. For more information on min
aggregate function, please refer to Teradata SQL translation vignette. Use min(distinct(column_name))
to exclude duplicate rows while calculating minimum value.
# Calculate the minimum value in the 'temperature' column grouped by 'buoyid' column.
# Please note that the output of the function remains the same (excludes NULL/NA values in computation) irrespective of the argument 'na.rm' set to TRUE. Warning 'Missing values are always removed in SQL.' will be suppressed if this is set to TRUE, otherwise the warning is raised.
df_min <- df_seq %>% group_by(buoyid) %>% summarise(min_temp = min(temperature, na.rm = TRUE))
# Print the results.
df_min %>% arrange(buoyid)
#> # Source: lazy query [?? x 2]
#> # Database: Teradata
#> # Ordered by: buoyid
#> buoyid min_temp
#> <int> <int>
#> 1 0 10
#> 2 1 70
#> 3 2 80
#> 4 22 23
#> 5 44 43
# Exclude duplicates in the same aggregate operation above.
df_min <- df_seq %>% group_by(buoyid) %>% summarise(min_temp = min(distinct(temperature)))
# Print the results.
df_min %>% arrange(buoyid)
#> Warning: Missing values are always removed in SQL.
#> Use `MIN(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> # Source: lazy query [?? x 2]
#> # Database: Teradata
#> # Ordered by: buoyid
#> buoyid min_temp
#> <int> <int>
#> 1 0 10
#> 2 1 70
#> 3 2 80
#> 4 22 23
#> 5 44 43
# Calculate the minimum value in the 'temperature' column without any grouping.
df_min <- df_seq %>% summarise(min_temp = min(temperature))
# Print the results.
df_min
#> # Source: lazy query [?? x 1]
#> # Database: Teradata
#> min_temp
#> <int>
#> 1 10
# Exclude duplicates in the same aggregate operation above.
df_min <- df_seq %>% summarise(min_temp = min(distinct(temperature)))
# Print the results.
df_min
#> # Source: lazy query [?? x 1]
#> # Database: Teradata
#> min_temp
#> <int>
#> 1 10
The function min
can be used as window aggregates using dplyr verbs like mutate
, filter
.
# Create a new column 'min_t' which holds the minimum value in the 'temperature' column.
df_min <- df_seq %>% mutate(min_t = min(temperature))
# Print the query.
df_min %>% show_query()
#> <SQL>
#> Warning: Missing values are always removed in SQL.
#> Use `MIN(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> SELECT "TD_TIMECODE", "TD_SEQNO", "buoyid", "salinity", "temperature", "dates", MIN("temperature") OVER () AS "min_t"
#> FROM "ocean_buoys_seq"
# Print the results.
df_min %>% arrange(TD_TIMECODE, TD_SEQNO)
#> # Source: lazy query [?? x 7]
#> # Database: Teradata
#> # Ordered by: TD_TIMECODE, TD_SEQNO
#> TD_TIMECODE TD_SEQNO buoyid salinity temperature dates min_t
#> <dttm> <int> <int> <int> <int> <date> <int>
#> 1 2014-01-06 08:00:00 26 0 55 10 2016-02-26 10
#> 2 2014-01-06 08:08:59 18 0 55 NA 2015-06-18 10
#> 3 2014-01-06 08:09:59 17 0 55 99 2015-05-17 10
#> 4 2014-01-06 08:10:00 19 0 55 10 2015-07-19 10
#> 5 2014-01-06 08:10:00 27 0 55 100 2016-03-27 10
#> 6 2014-01-06 09:01:25 11 1 55 70 2014-11-11 10
#> # ... with more rows
# Filter the rows with one-tenth of temperature values greater than or equal to the minimum value in 'temperature' column.
df_min <- df_seq %>% filter(temperature/10 >= min(temperature))
# Print the results.
df_min
#> # Source: lazy query [?? x 6]
#> # Database: Teradata
#> TD_TIMECODE TD_SEQNO buoyid salinity temperature dates
#> <dttm> <int> <int> <int> <int> <date>
#> 1 2014-01-06 08:10:00 27 0 55 100 2016-03-27
Notes
- NULLs/NAs are not included in the result computation.
- Teradata recommends to strictly use
min
, but not time series aggregatets.min
, for regular aggregate operations and window aggregate operations. - The use of
distinct
is not permitted in window aggregate functions.
Maximum : max()
The aggregate function max
returns the maximum value in the column. For more information on max
aggregate function, please refer to Teradata SQL translation vignette. Use max(distinct(column_name))
to exclude duplicate rows while calculating maximum value.
# Calculate the maximum value in the 'temperature' column grouped by 'buoyid' column.
# Please note that the output of the function remains the same (excludes NULL/NA values in computation) irrespective of the argument 'na.rm' set to TRUE. Warning 'Missing values are always removed in SQL.' will be suppressed if this is set to TRUE, otherwise the warning is raised.
df_max <- df_seq %>% group_by(buoyid) %>% summarise(max_temp = max(temperature, na.rm = TRUE))
# Print the results.
df_max %>% arrange(buoyid)
#> # Source: lazy query [?? x 2]
#> # Database: Teradata
#> # Ordered by: buoyid
#> buoyid max_temp
#> <int> <int>
#> 1 0 100
#> 2 1 79
#> 3 2 82
#> 4 22 23
#> 5 44 56
# Exclude duplicates in the same aggregate operation above.
df_max <- df_seq %>% group_by(buoyid) %>% summarise(max_temp = max(distinct(temperature)))
# Print the results.
df_max %>% arrange(buoyid)
#> Warning: Missing values are always removed in SQL.
#> Use `MAX(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> # Source: lazy query [?? x 2]
#> # Database: Teradata
#> # Ordered by: buoyid
#> buoyid max_temp
#> <int> <int>
#> 1 0 100
#> 2 1 79
#> 3 2 82
#> 4 22 23
#> 5 44 56
# Calculate the maximum value in the 'temperature' column without any grouping.
df_max <- df_seq %>% summarise(max_temp = max(temperature))
# Print the results.
df_max
#> # Source: lazy query [?? x 1]
#> # Database: Teradata
#> max_temp
#> <int>
#> 1 100
# Exclude duplicates in the same aggregate operation above.
df_max <- df_seq %>% summarise(max_temp = max(distinct(temperature)))
# Print the results.
df_max
#> # Source: lazy query [?? x 1]
#> # Database: Teradata
#> max_temp
#> <int>
#> 1 100
The function max
can be used as window aggregates using dplyr verbs like mutate
, filter
.
# Create a new column 'max_t' which holds the maximum value in the 'temperature' column.
df_max <- df_seq %>% mutate(max_t = max(temperature))
# Print the query.
df_max %>% show_query()
#> <SQL>
#> Warning: Missing values are always removed in SQL.
#> Use `MAX(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> SELECT "TD_TIMECODE", "TD_SEQNO", "buoyid", "salinity", "temperature", "dates", MAX("temperature") OVER () AS "max_t"
#> FROM "ocean_buoys_seq"
# Print the results.
df_max %>% arrange(TD_TIMECODE, TD_SEQNO)
#> # Source: lazy query [?? x 7]
#> # Database: Teradata
#> # Ordered by: TD_TIMECODE, TD_SEQNO
#> TD_TIMECODE TD_SEQNO buoyid salinity temperature dates max_t
#> <dttm> <int> <int> <int> <int> <date> <int>
#> 1 2014-01-06 08:00:00 26 0 55 10 2016-02-26 100
#> 2 2014-01-06 08:08:59 18 0 55 NA 2015-06-18 100
#> 3 2014-01-06 08:09:59 17 0 55 99 2015-05-17 100
#> 4 2014-01-06 08:10:00 19 0 55 10 2015-07-19 100
#> 5 2014-01-06 08:10:00 27 0 55 100 2016-03-27 100
#> 6 2014-01-06 09:01:25 11 1 55 70 2014-11-11 100
#> # ... with more rows
# Filter the rows with temperature values less than or equal to one-tenth of the maximum value in 'temperature' column.
df_max <- df_seq %>% filter(temperature <= max(temperature)/10)
# Print the results.
df_max
#> # Source: lazy query [?? x 6]
#> # Database: Teradata
#> TD_TIMECODE TD_SEQNO buoyid salinity temperature dates
#> <dttm> <int> <int> <int> <int> <date>
#> 1 2014-01-06 08:10:00 19 0 55 10 2015-07-19
#> 2 2014-01-06 08:00:00 26 0 55 10 2016-02-26
Notes
- NULLs/NAs are not included in the result computation.
- Teradata recommends to strictly use
max
, but not time series aggregatets.max
, for regular aggregate operations and window aggregate operations. - The use of
distinct
is not permitted in window aggregate functions.
Sample Variance : var()
The aggregate function var
returns the sample variance of values of the column. The variance of a sample is a measure of dispersion from the mean of that sample. It is the square of the sample standard deviation. For more information on var
aggregate function, please refer to Teradata SQL translation vignette. Use var(distinct(column_name))
to exclude duplicate rows while calculating sample variance.
# Calculate the sample variance of the 'temperature' column grouped by 'buoyid' column.
# Please note that the output of the function remains the same (excludes NULL/NA values in computation) irrespective of the argument 'na.rm' set to TRUE. Warning 'Missing values are always removed in SQL.' will be suppressed if this is set to TRUE, otherwise the warning is raised.
df_var <- df_seq %>% group_by(buoyid) %>% summarise(var_temp = var(temperature, na.rm = TRUE))
# Print the results.
df_var %>% arrange(buoyid)
#> # Source: lazy query [?? x 2]
#> # Database: Teradata
#> # Ordered by: buoyid
#> buoyid var_temp
#> <int> <dbl>
#> 1 0 2670.
#> 2 1 15.5
#> 3 2 1
#> 4 22 NA
#> 5 44 33.2
# Exclude duplicates in the same aggregate operation above.
df_var <- df_seq %>% group_by(buoyid) %>% summarise(var_temp = var(distinct(temperature)))
# Print the results.
df_var %>% arrange(buoyid)
#> Warning: Missing values are always removed in SQL.
#> Use `VAR_SAMP(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> # Source: lazy query [?? x 2]
#> # Database: Teradata
#> # Ordered by: buoyid
#> buoyid var_temp
#> <int> <dbl>
#> 1 0 2670.
#> 2 1 15.5
#> 3 2 1
#> 4 22 NA
#> 5 44 27.7
# Calculate the sample variance of the values in the 'temperature' column without any grouping.
df_var <- df_seq %>% summarise(var_temp = var(temperature))
# Print the results.
df_var
#> Warning: Missing values are always removed in SQL.
#> Use `VAR_SAMP(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> # Source: lazy query [?? x 1]
#> # Database: Teradata
#> var_temp
#> <dbl>
#> 1 548.
# Exclude duplicates in the same aggregate operation above.
df_var <- df_seq %>% summarise(var_temp = var(distinct(temperature)))
# Print the results.
df_var
#> Warning: Missing values are always removed in SQL.
#> Use `VAR_SAMP(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> # Source: lazy query [?? x 1]
#> # Database: Teradata
#> var_temp
#> <dbl>
#> 1 558.
The function var
can be used as window aggregates using dplyr verbs like mutate
, filter
.
# Create a new column 'var_t' which holds the sample variance of the values in the 'temperature' column.
df_var <- df_seq %>% mutate(var_t = var(temperature))
# Print the query.
df_var %>% show_query()
#> <SQL>
#> Warning: Missing values are always removed in SQL.
#> Use `VAR_SAMP(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> SELECT "TD_TIMECODE", "TD_SEQNO", "buoyid", "salinity", "temperature", "dates", VAR_SAMP("temperature") OVER () AS "var_t"
#> FROM "ocean_buoys_seq"
# Print the results.
df_var %>% arrange(TD_TIMECODE, TD_SEQNO)
#> # Source: lazy query [?? x 7]
#> # Database: Teradata
#> # Ordered by: TD_TIMECODE, TD_SEQNO
#> TD_TIMECODE TD_SEQNO buoyid salinity temperature dates var_t
#> <dttm> <int> <int> <int> <int> <date> <dbl>
#> 1 2014-01-06 08:00:00 26 0 55 10 2016-02-26 548.
#> 2 2014-01-06 08:08:59 18 0 55 NA 2015-06-18 548.
#> 3 2014-01-06 08:09:59 17 0 55 99 2015-05-17 548.
#> 4 2014-01-06 08:10:00 19 0 55 10 2015-07-19 548.
#> 5 2014-01-06 08:10:00 27 0 55 100 2016-03-27 548.
#> 6 2014-01-06 09:01:25 11 1 55 70 2014-11-11 548.
#> # ... with more rows
# Filter the rows with temperature values less than one-tenth of the sample variance of the values in the 'temperature' column.
df_var <- df_seq %>% filter(temperature < var(temperature)/10)
# Print the results.
df_var
#> # Source: lazy query [?? x 6]
#> # Database: Teradata
#> TD_TIMECODE TD_SEQNO buoyid salinity temperature dates
#> <dttm> <int> <int> <int> <int> <date>
#> 1 2014-01-06 10:00:24 4 44 55 43 2014-04-04
#> 2 2014-01-06 10:00:25 6 44 55 43 2014-06-06
#> 3 2014-01-06 10:00:26 7 44 55 43 2014-07-07
#> 4 2014-01-06 10:01:25 8 44 55 53 2014-08-08
#> 5 2014-01-06 10:00:24 5 44 55 43 2014-05-05
#> 6 2014-01-06 08:10:00 19 0 55 10 2015-07-19
#> # ... with more rows
Notes
- NULLs/NAs are not included in the result computation.
- Teradata recommends to strictly use
var
for regular aggregate and window operations, instead of time series aggregate functionts.var
. - When there are fewer than two non-NULL data points in the sample used for the computation,
var
returns NULL/NA. - Division by zero results in NULL/NA value rather than an error.
- If data represents only a sample of the entire population for the column, Teradata recommends to use
var
to calculate sample variance instead ofvarp
which calculates population variance. As the sample size increases, the values forvar
andvarp
approach the same number. - The use of
distinct
is not permitted in window aggregate functions.
Population Variance : varp()
The aggregate function varp
returns the population variance of values of the column. The variance of a population is a measure of dispersion from the mean of that population. For more information on varp
aggregate function, please refer to Teradata SQL translation vignette. Use varp(distinct(column_name))
to exclude duplicate rows while calculating population variance.
# Calculate the population variance of the 'temperature' column grouped by 'buoyid' column.
# Please note that the output of the function remains the same (excludes NULL/NA values in computation) irrespective of the argument 'na.rm' set to TRUE. Warning 'Missing values are always removed in SQL.' will be suppressed if this is set to TRUE, otherwise the warning is raised.
df_varp <- df_seq %>% group_by(buoyid) %>% summarise(varp_temp = varp(temperature, na.rm = TRUE))
# Print the results.
df_varp %>% arrange(buoyid)
#> # Source: lazy query [?? x 2]
#> # Database: Teradata
#> # Ordered by: buoyid
#> buoyid varp_temp
#> <int> <dbl>
#> 1 0 2003.
#> 2 1 12.9
#> 3 2 0.667
#> 4 22 0
#> 5 44 30.7
# Exclude duplicates in the same aggregate operation above.
df_varp <- df_seq %>% group_by(buoyid) %>% summarise(varp_temp = varp(distinct(temperature)))
# Print the results.
df_varp %>% arrange(buoyid)
#> Warning: Missing values are always removed in SQL.
#> Use `VAR_POP(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> # Source: lazy query [?? x 2]
#> # Database: Teradata
#> # Ordered by: buoyid
#> buoyid varp_temp
#> <int> <dbl>
#> 1 0 1780.
#> 2 1 12.9
#> 3 2 0.667
#> 4 22 0
#> 5 44 22.2
# Calculate the population variance of the values in the 'temperature' column without any grouping.
df_varp <- df_seq %>% summarise(varp_temp = varp(temperature))
# Print the results.
df_varp
#> Warning: Missing values are always removed in SQL.
#> Use `VAR_POP(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> # Source: lazy query [?? x 1]
#> # Database: Teradata
#> varp_temp
#> <dbl>
#> 1 528.
# Exclude duplicates in the same aggregate operation above.
df_varp <- df_seq %>% summarise(varp_temp = varp(distinct(temperature)))
# Print the results.
df_varp
#> Warning: Missing values are always removed in SQL.
#> Use `VAR_POP(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> # Source: lazy query [?? x 1]
#> # Database: Teradata
#> varp_temp
#> <dbl>
#> 1 527.
The function varp
can be used as window aggregates using dplyr verbs like mutate
, filter
.
# Create a new column 'var_t' which holds the population variance of the values in the 'temperature' column.
df_varp <- df_seq %>% mutate(var_t = varp(temperature))
# Print the query.
df_varp %>% show_query()
#> <SQL>
#> Warning: Missing values are always removed in SQL.
#> Use `VAR_POP(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> SELECT "TD_TIMECODE", "TD_SEQNO", "buoyid", "salinity", "temperature", "dates", VAR_POP("temperature") OVER () AS "var_t"
#> FROM "ocean_buoys_seq"
# Print the results.
df_varp %>% arrange(TD_TIMECODE, TD_SEQNO)
#> Warning: Missing values are always removed in SQL.
#> Use `VAR_POP(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> # Source: lazy query [?? x 7]
#> # Database: Teradata
#> # Ordered by: TD_TIMECODE, TD_SEQNO
#> TD_TIMECODE TD_SEQNO buoyid salinity temperature dates var_t
#> <dttm> <int> <int> <int> <int> <date> <dbl>
#> 1 2014-01-06 08:00:00 26 0 55 10 2016-02-26 528.
#> 2 2014-01-06 08:08:59 18 0 55 NA 2015-06-18 528.
#> 3 2014-01-06 08:09:59 17 0 55 99 2015-05-17 528.
#> 4 2014-01-06 08:10:00 19 0 55 10 2015-07-19 528.
#> 5 2014-01-06 08:10:00 27 0 55 100 2016-03-27 528.
#> 6 2014-01-06 09:01:25 11 1 55 70 2014-11-11 528.
#> # ... with more rows
# Filter the rows with temperature values less than one-tenth of the population variance of the values in the 'temperature' column.
df_varp <- df_seq %>% filter(temperature < varp(temperature)/10)
# Print the results.
df_varp
#> Warning: Missing values are always removed in SQL.
#> Use `VAR_POP(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> # Source: lazy query [?? x 6]
#> # Database: Teradata
#> TD_TIMECODE TD_SEQNO buoyid salinity temperature dates
#> <dttm> <int> <int> <int> <int> <date>
#> 1 2014-01-06 10:00:24 4 44 55 43 2014-04-04
#> 2 2014-01-06 10:00:25 6 44 55 43 2014-06-06
#> 3 2014-01-06 10:00:26 7 44 55 43 2014-07-07
#> 4 2014-01-06 10:12:00 3 44 55 43 2014-03-03
#> 5 2014-01-06 10:00:24 5 44 55 43 2014-05-05
#> 6 2014-01-06 08:10:00 19 0 55 10 2015-07-19
#> # ... with more rows
Notes
- NULLs/NAs are not included in the result computation.
- Teradata recommends to strictly use
varp
for regular aggregate and window operations, instead of time series aggregate functionts.varp
. - When there are fewer than two non-NULL data points in the population used for the computation,
varp
returns NULL/NA. - Division by zero results in NULL/NA value rather than an error.
varp
can only be used if data represents entire population. Otherwise, Teradata recommends to usevar
to calculate sample variance.- The use of
distinct
is not permitted in window aggregate functions.
Average : mean()
The aggregate function mean
returns the average value in the column. For more information on mean
aggregate function, please refer to Teradata SQL translation vignette. Use mean(distinct(column_name))
to exclude duplicate rows while calculating average.
# Calculate the average value in the 'temperature' column grouped by 'buoyid' column.
# Please note that the output of the function remains the same (excludes NULL/NA values in computation) irrespective of the argument 'na.rm' set to TRUE. Warning 'Missing values are always removed in SQL.' will be suppressed if this is set to TRUE, otherwise the warning is raised.
df_avg <- df_seq %>% group_by(buoyid) %>% summarise(avg_temp = mean(temperature, na.rm = TRUE))
# Print the results.
df_avg %>% arrange(buoyid)
#> # Source: lazy query [?? x 2]
#> # Database: Teradata
#> # Ordered by: buoyid
#> buoyid avg_temp
#> <int> <dbl>
#> 1 0 54.8
#> 2 1 74.5
#> 3 2 81
#> 4 22 23
#> 5 44 48.1
# Exclude duplicates in the same aggregate operation above.
df_avg <- df_seq %>% group_by(buoyid) %>% summarise(avg_temp = mean(distinct(temperature)))
# Print the results.
df_avg %>% arrange(buoyid)
#> Warning: Missing values are always removed in SQL.
#> Use `mean(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> # Source: lazy query [?? x 2]
#> # Database: Teradata
#> # Ordered by: buoyid
#> buoyid avg_temp
#> <int> <dbl>
#> 1 0 69.7
#> 2 1 74.5
#> 3 2 81
#> 4 22 23
#> 5 44 52.2
# Calculate the average value in the 'temperature' column without any grouping.
df_avg <- df_seq %>% summarise(avg_temp = mean(temperature))
# Print the results.
df_avg
#> # Source: lazy query [?? x 1]
#> # Database: Teradata
#> avg_temp
#> <dbl>
#> 1 57.7
# Exclude duplicates in the same aggregate operation above.
df_avg <- df_seq %>% summarise(avg_temp = mean(distinct(temperature)))
# Print the results.
df_avg
#> # Source: lazy query [?? x 1]
#> # Database: Teradata
#> avg_temp
#> <dbl>
#> 1 65.7
The function mean
can be used as window aggregates using dplyr verbs like mutate
, filter
.
# Create a new column 'avg_t' which holds the average value in the 'temperature' column.
df_avg <- df_seq %>% mutate(avg_t = mean(temperature))
# Print the query.
df_avg %>% show_query()
#> <SQL>
#> Warning: Missing values are always removed in SQL.
#> Use `AVG(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> SELECT "TD_TIMECODE", "TD_SEQNO", "buoyid", "salinity", "temperature", "dates", AVG("temperature") OVER () AS "avg_t"
#> FROM "ocean_buoys_seq"
# Print the results.
df_avg %>% arrange(TD_TIMECODE, TD_SEQNO)
#> # Source: lazy query [?? x 7]
#> # Database: Teradata
#> # Ordered by: TD_TIMECODE, TD_SEQNO
#> TD_TIMECODE TD_SEQNO buoyid salinity temperature dates avg_t
#> <dttm> <int> <int> <int> <int> <date> <dbl>
#> 1 2014-01-06 08:00:00 26 0 55 10 2016-02-26 57.7
#> 2 2014-01-06 08:08:59 18 0 55 NA 2015-06-18 57.7
#> 3 2014-01-06 08:09:59 17 0 55 99 2015-05-17 57.7
#> 4 2014-01-06 08:10:00 19 0 55 10 2015-07-19 57.7
#> 5 2014-01-06 08:10:00 27 0 55 100 2016-03-27 57.7
#> 6 2014-01-06 09:01:25 11 1 55 70 2014-11-11 57.7
#> # ... with more rows
# Filter the rows with temperature values less than one-fifth of the average value in 'temperature' column.
df_avg <- df_seq %>% filter(temperature < mean(temperature)/5)
# Print the results.
df_avg
#> # Source: lazy query [?? x 6]
#> # Database: Teradata
#> TD_TIMECODE TD_SEQNO buoyid salinity temperature dates
#> <dttm> <int> <int> <int> <int> <date>
#> 1 2014-01-06 08:10:00 19 0 55 10 2015-07-19
#> 2 2014-01-06 08:00:00 26 0 55 10 2016-02-26
Notes
- NULLs/NAs are not included in the result computation.
- Teradata recommends to strictly use
mean
, but not time series aggregatets.mean
, for regular aggregate operations and window aggregate operations. - The use of
distinct
is not permitted in window aggregate functions.
Count : n() and n_distinct()
The aggregate function n
returns the total number of qualified rows in the column. n
uses one argument named x
, which specifies the column for which total number of rows is to be computed. For more information on n
aggregate function, please refer to Teradata SQL translation vignette. Use n(distinct(column_name))
to exclude duplicate rows while calculating the number of rows. Another way of accessing distinct rows is n_distinct(column_name)
.
# Calculate the number of rows in the 'temperature' column grouped by 'buoyid' column.
df_count <- df_seq %>% group_by(buoyid) %>% summarise(count_temp = n(temperature))
# Print the results.
df_count %>% arrange(buoyid)
#> # Source: lazy query [?? x 2]
#> # Database: Teradata
#> # Ordered by: buoyid
#> buoyid count_temp
#> <int> <int64>
#> 1 0 4
#> 2 1 6
#> 3 2 3
#> 4 22 1
#> 5 44 13
# Exclude duplicates in the same aggregate operation above.
df_count <- df_seq %>% group_by(buoyid) %>% summarise(count_temp = n(distinct(temperature)))
# Print the results.
df_count %>% arrange(buoyid)
#> # Source: lazy query [?? x 2]
#> # Database: Teradata
#> # Ordered by: buoyid
#> buoyid count_temp
#> <int> <int64>
#> 1 0 3
#> 2 1 6
#> 3 2 3
#> 4 22 1
#> 5 44 5
# Another way of excluding duplicates in the same aggregate operation above.
df_count <- df_seq %>% group_by(buoyid) %>% summarise(count_temp = n_distinct(temperature))
# Print the results.
df_count %>% arrange(buoyid)
#> # Source: lazy query [?? x 2]
#> # Database: Teradata
#> # Ordered by: buoyid
#> buoyid count_temp
#> <int> <int64>
#> 1 0 3
#> 2 1 6
#> 3 2 3
#> 4 22 1
#> 5 44 5
# Calculate the number of rows in the table grouped by 'buoyid' column.
# Note that the argument is not provided for the function `n()`.
df_count <- df_seq %>% group_by(buoyid) %>% summarise(count_temp = n())
# Print the results.
df_count %>% arrange(buoyid)
#> # Source: lazy query [?? x 2]
#> # Database: Teradata
#> # Ordered by: buoyid
#> buoyid count_temp
#> <int> <int64>
#> 1 0 5
#> 2 1 6
#> 3 2 3
#> 4 22 1
#> 5 44 13
# Calculate the number of rows in the 'temperature' column without any grouping.
df_count <- df_seq %>% summarise(count_temp = n(temperature))
# Print the results.
df_count
#> # Source: lazy query [?? x 1]
#> # Database: Teradata
#> count_temp
#> <int64>
#> 1 27
# Exclude duplicates in the same aggregate operation above.
df_count <- df_seq %>% summarise(count_temp = n(distinct(temperature)))
# Print the results.
df_count
#> # Source: lazy query [?? x 1]
#> # Database: Teradata
#> count_temp
#> <int64>
#> 1 18
# Calculate the number of rows in the table without any grouping.
# Note that the argument is not provided for the function `n()`.
df_count <- df_seq %>% summarise(count_temp = n())
# Print the results.
df_count
#> # Source: lazy query [?? x 1]
#> # Database: Teradata
#> count_temp
#> <int64>
#> 1 28
The function n
can be used as window aggregates using dplyr verbs like mutate
, filter
. However, the n_distinct
aggregate function cannot be used with
# Create a new column 'count_t' which holds the number of rows in the 'temperature' column.
df_count <- df_seq %>% mutate(count_t = n(temperature))
# Print the query.
df_count %>% show_query()
#> <SQL>
#> SELECT "TD_TIMECODE", "TD_SEQNO", "buoyid", "salinity", "temperature", "dates", CAST(COUNT("temperature") OVER () AS BIGINT) AS "count_t"
#> FROM "ocean_buoys_seq"
# Print the results.
df_count %>% select(TD_TIMECODE, TD_SEQNO, buoyid, count_t) %>% arrange(TD_TIMECODE, TD_SEQNO)
#> # Source: lazy query [?? x 4]
#> # Database: Teradata
#> # Ordered by: TD_TIMECODE, TD_SEQNO
#> TD_TIMECODE TD_SEQNO buoyid count_t
#> <dttm> <int> <int> <int64>
#> 1 2014-01-06 08:00:00 26 0 27
#> 2 2014-01-06 08:08:59 18 0 27
#> 3 2014-01-06 08:09:59 17 0 27
#> 4 2014-01-06 08:10:00 19 0 27
#> 5 2014-01-06 08:10:00 27 0 27
#> 6 2014-01-06 09:01:25 11 1 27
#> # ... with more rows
# Create a new column 'count_t' which holds the number of rows in the table.
# Note that the argument is not provided for the function `n()`.
df_count <- df_seq %>% mutate(count_t = n())
# Print the query.
df_count %>% show_query()
#> <SQL>
#> SELECT "TD_TIMECODE", "TD_SEQNO", "buoyid", "salinity", "temperature", "dates", CAST(COUNT(*) OVER () AS BIGINT) AS "count_t"
#> FROM "ocean_buoys_seq"
# Print the results.
df_count %>% select(TD_TIMECODE, TD_SEQNO, buoyid, count_t) %>% arrange(TD_TIMECODE, TD_SEQNO)
#> # Source: lazy query [?? x 4]
#> # Database: Teradata
#> # Ordered by: TD_TIMECODE, TD_SEQNO
#> TD_TIMECODE TD_SEQNO buoyid count_t
#> <dttm> <int> <int> <int64>
#> 1 2014-01-06 08:00:00 26 0 28
#> 2 2014-01-06 08:08:59 18 0 28
#> 3 2014-01-06 08:09:59 17 0 28
#> 4 2014-01-06 08:10:00 19 0 28
#> 5 2014-01-06 08:10:00 27 0 28
#> 6 2014-01-06 09:01:25 11 1 28
#> # ... with more rows
# Filter the rows with temperature values less than the number of rows in 'temperature' column.
df_count <- df_seq %>% filter(temperature < n(temperature))
# Print the results.
df_count %>% arrange(TD_TIMECODE, TD_SEQNO)
#> # Source: lazy query [?? x 6]
#> # Database: Teradata
#> # Ordered by: TD_TIMECODE, TD_SEQNO
#> TD_TIMECODE TD_SEQNO buoyid salinity temperature dates
#> <dttm> <int> <int> <int> <int> <date>
#> 1 2014-01-06 08:00:00 26 0 55 10 2016-02-26
#> 2 2014-01-06 08:10:00 19 0 55 10 2015-07-19
#> 3 2014-01-06 10:32:12 1 22 25 23 2014-01-01
Notes
- NULLs/NAs are not included in the result computation.
n
andn_distinct
are valid on all columns with any data type.- Teradata recommends to strictly use
n
, but not time series aggregatets.n
, for regular aggregate operations and window aggregate operations. - The use of
distinct
is not permitted in window aggregate functions i.e.,n(distinct(column_name))
andn_distinct(column_name)
cannot be used as window aggregate operations.