- Introduction to Time Series Aggregates
- group_by_time API
- Aggregate Functions
- Mode : ts.mode()
- Bottom : ts.bottom()
- Top : ts.top()
- Median : ts.median()
- First : ts.first()
- Last : ts.last()
- Median Absolute Deviation : ts.mad()
- Percentile : ts.percentile()
- Delta_t : ts.delta_t()
- Kurtosis : ts.kurtosis()
- Skew : ts.skew()
- Sum : ts.sum()
- Sample Standard Deviation : ts.sd()
- Population Standard Deviation : ts.sdp()
- Minimum : ts.min()
- Maximum : ts.max()
- Sample Variance : ts.var()
- Population Variance : ts.varp()
- Average : ts.mean()
- Count : ts.n()
- Describe : ts.describe()
Introduction to Time Series Aggregates
The Time Series(TS) aggregate functions help perform aggregate operations on time series data. The goal of this vignette is to provide introductory examples to get the user familiar with the group_by_time API and other TS aggregate functions.
Each time series aggregate function has a section in this vignette. Each section can be read in isolation, except where references are made to other sections or other vignettes. Each section also has a subsection called “Things to Note” for good practices and specific behavior to consider when using the aggregate functions with Teradata Vantage.
Once the connection is established using td_create_context, you can load the example tables ocean_buoys_seq
, ocean_buoys_nonseq
, ocean_buoys_nonpti
, package_tracking_pti
and package_tracking_nonpti
and create respective tibbles using the following statements:
loadExampleData("time_series_example", "ocean_buoys_seq", "ocean_buoys_nonseq", "ocean_buoys_nonpti", "package_tracking_pti", "package_tracking_nonpti")
# Create tbl_teradata objects.
df_seq <- tbl(con, "ocean_buoys_seq")
df_nonseq <- tbl(con, "ocean_buoys_nonseq")
df_nonpti <- tbl(con, "ocean_buoys_nonpti")
df_pack_pti <- tbl(con, "package_tracking_pti")
df_pack_nonpti <- tbl(con, "package_tracking_nonpti")
group_by_time API
Every time series aggregation must be preceded with grouping by time using the API group_by_time
. For more information on the API, one can use help(group_by_time)
or ?group_by_time
in R console. Here, we provide different usages of group_by_time
.
# Grouping the sequenced PTI tables based on time with timebucket duration of 30 minutes and the column 'buoyid'. Note the use of shorthand notation for timebucket duration.
df_seq_grp <- df_seq %>%
group_by_time(timebucket.duration = "30m", value.expression = "buoyid")
# Grouping the non-PTI tables based on time with timebucket duration of 1 minute and filling the missing timebuckets with previous values. Note the use of formal notation for timebucket duration and timecode.column argument (mandatory for non-PTI table).
df_nonpti_grp <- df_nonpti %>%
group_by_time(timebucket.duration = "MINUTES(1)", timecode.column = "TIMECODE", fill = "PREV")
# Grouping the non-PTI tables based on time with timebucket duration of 1 Calendar Year with fill = 10000 (some numeric constant)
df_nonseq_grp <- df_nonseq %>%
group_by_time(timebucket.duration = "CAL_YEARS(1)", fill = 10000)
Aggregate Functions
Mode : ts.mode()
The aggregate function ts.mode
returns the mode of all values in each group. In the event of a tie between two or more values, a row per result is returned. ts.mode
uses the argument value.expression
to specify the column for which mode is to be computed.
# Calculate the mode of the 'temperature' column of sequenced PTI table.
df_seq_mode <- df_seq_grp %>% summarise(mode_temp = ts.mode(temperature))
# Print the results.
df_seq_mode %>% arrange(TIMECODE_RANGE, buoyid, mode_temp)
#> # Source: lazy query [?? x 4]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, buoyid, mode_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUT~ buoyid mode_temp
#> <chr> <int64> <int> <int>
#> 1 2014-01-06 08:00:00.000000+00:00,2014-~ 35345 0 10
#> 2 2014-01-06 09:00:00.000000+00:00,2014-~ 35347 1 70
#> 3 2014-01-06 09:00:00.000000+00:00,2014-~ 35347 1 71
#> 4 2014-01-06 09:00:00.000000+00:00,2014-~ 35347 1 72
#> 5 2014-01-06 09:00:00.000000+00:00,2014-~ 35347 1 77
#> 6 2014-01-06 09:00:00.000000+00:00,2014-~ 35347 1 78
#> # ... with more rows
# Calculate the mode of the 'temperature' column of non-sequenced PTI table.
df_nonseq_mode <- df_nonseq_grp %>% summarise(mode_temp = ts.mode(temperature))
# Print the results.
df_nonseq_mode %>% arrange(TIMECODE_RANGE, mode_temp)
#> # Source: lazy query [?? x 3]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, mode_temp
#> TIMECODE_RANGE `GROUP BY TIME(CAL_YEA~ mode_temp
#> <chr> <int64> <int>
#> 1 2014-01-01 00:00:00.000000+00:00,2015-01-01~ 3 43
# Calculate the mode of the 'temperature' column of non-PTI table.
df_nonpti_mode <- df_nonpti_grp %>% summarise(mode_temp = ts.mode(temperature))
# Print the results.
df_nonpti_mode %>% arrange(TIMECODE_RANGE, mode_temp)
#> # Source: lazy query [?? x 3]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, mode_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUTE~ mode_temp
#> <chr> <int64> <int>
#> 1 2014-01-06 08:00:00.000000+00:00,2014-01-06 ~ 23149921 10
#> 2 2014-01-06 08:01:00.000000+00:00,2014-01-06 ~ 23149922 10
#> 3 2014-01-06 08:02:00.000000+00:00,2014-01-06 ~ 23149923 10
#> 4 2014-01-06 08:03:00.000000+00:00,2014-01-06 ~ 23149924 10
#> 5 2014-01-06 08:04:00.000000+00:00,2014-01-06 ~ 23149925 10
#> 6 2014-01-06 08:05:00.000000+00:00,2014-01-06 ~ 23149926 10
#> # ... with more rows
Notes
- The data type of the column containing mode values is same as that of the column for which mode is to be computed.
- NULLs/NAs are not included in the result computation.
ts.mode
is valid only for numeric data.ts.mode
must be used withgroup_by_time()
.- Use of
distinct
is not allowed forts.mode
function.
Bottom : ts.bottom()
The aggregate function ts.bottom
returns the smallest number.of.values in the value.expression for each group, with or without ties. ts.bottom
uses three arguments:
value.expression
which specifies the column from which bottom values are to be returned.number.of.values
which specifies the integer value representing the number of values to return.with.ties
which specifies a flag to decide whether to run bottom function with ties or not. Default value is FALSE.
# Get the smallest 2 values of the 'temperature' column for each group without ties of sequenced PTI table.
df_seq_bottom <- df_seq_grp %>% summarise(bottom_temp = ts.bottom(temperature, 2))
# Print the results.
df_seq_bottom %>% arrange(TIMECODE_RANGE, buoyid, bottom_temp)
#> # Source: lazy query [?? x 4]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, buoyid, bottom_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUT~ buoyid bottom_temp
#> <chr> <int64> <int> <int>
#> 1 2014-01-06 08:00:00.000000+00:00,201~ 35345 0 10
#> 2 2014-01-06 08:00:00.000000+00:00,201~ 35345 0 10
#> 3 2014-01-06 09:00:00.000000+00:00,201~ 35347 1 70
#> 4 2014-01-06 09:00:00.000000+00:00,201~ 35347 1 71
#> 5 2014-01-06 10:00:00.000000+00:00,201~ 35349 44 43
#> 6 2014-01-06 10:00:00.000000+00:00,201~ 35349 44 43
#> # ... with more rows
# Get the smallest 4 values of the 'temperature' column for each group with ties of non-sequenced PTI table.
df_nonseq_bottom <- df_nonseq_grp %>% summarise(bottom_temp = ts.bottom(temperature, 4, TRUE))
# Print the results.
df_nonseq_bottom %>% arrange(TIMECODE_RANGE, bottom_temp)
#> # Source: lazy query [?? x 3]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, bottom_temp
#> TIMECODE_RANGE `GROUP BY TIME(CAL_YEA~ bottom_temp
#> <chr> <int64> <int>
#> 1 2014-01-01 00:00:00.000000+00:00,2015-01-~ 3 10
#> 2 2014-01-01 00:00:00.000000+00:00,2015-01-~ 3 10
#> 3 2014-01-01 00:00:00.000000+00:00,2015-01-~ 3 43
#> 4 2014-01-01 00:00:00.000000+00:00,2015-01-~ 3 43
#> 5 2014-01-01 00:00:00.000000+00:00,2015-01-~ 3 43
#> 6 2014-01-01 00:00:00.000000+00:00,2015-01-~ 3 43
#> # ... with more rows
# Get the smallest 2 values of the 'temperature' column for each group with ties of non-PTI table.
df_nonpti_bottom <- df_nonpti_grp %>% summarise(bottom_temp = ts.bottom(temperature, 2, TRUE))
# Print the results.
df_nonpti_bottom %>% arrange(TIMECODE_RANGE, bottom_temp)
#> # Source: lazy query [?? x 3]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, bottom_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUTE~ bottom_temp
#> <chr> <int64> <int>
#> 1 2014-01-06 08:00:00.000000+00:00,2014-01-0~ 23149921 10
#> 2 2014-01-06 08:01:00.000000+00:00,2014-01-0~ 23149922 10
#> 3 2014-01-06 08:02:00.000000+00:00,2014-01-0~ 23149923 10
#> 4 2014-01-06 08:03:00.000000+00:00,2014-01-0~ 23149924 10
#> 5 2014-01-06 08:04:00.000000+00:00,2014-01-0~ 23149925 10
#> 6 2014-01-06 08:05:00.000000+00:00,2014-01-0~ 23149926 10
#> # ... with more rows
Notes
- The data type of the column containing bottom values is same as that of the column for which bottom is to be computed.
- NULLs/NAs are not included in the result computation.
ts.bottom
is valid only for numeric data.ts.bottom
must be used withgroup_by_time()
.ts.bottom
with ties implies that the rows returned include the specified number of rows in the ordered set for each timebucket. It includes any rows where the sort key value is the same as the sort key value in the last row that satisfies the specified number or percentage of rows. If this clause is omitted and ties are found, the earliest value in terms of timecode is returned.- Use of
distinct
is not allowed forts.bottom
function.
Top : ts.top()
The aggregate function ts.top
returns the largest number.of.values in the value.expression for each group, with or without ties. ts.top
uses three arguments:
value.expression
which specifies the column from which top values are to be returned.number.of.values
which specifies the integer value representing the number of values to return.with.ties
which specifies a flag to decide whether to run top function with ties or not. Default value is FALSE.
# Get the largest 2 values of the 'temperature' column for each group without ties of sequenced PTI table.
df_seq_top <- df_seq_grp %>% summarise(top_temp = ts.top(temperature, 2))
# Print the results.
df_seq_top %>% arrange(TIMECODE_RANGE, buoyid, top_temp)
#> # Source: lazy query [?? x 4]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, buoyid, top_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUT~ buoyid top_temp
#> <chr> <int64> <int> <int>
#> 1 2014-01-06 08:00:00.000000+00:00,2014-0~ 35345 0 99
#> 2 2014-01-06 08:00:00.000000+00:00,2014-0~ 35345 0 100
#> 3 2014-01-06 09:00:00.000000+00:00,2014-0~ 35347 1 78
#> 4 2014-01-06 09:00:00.000000+00:00,2014-0~ 35347 1 79
#> 5 2014-01-06 10:00:00.000000+00:00,2014-0~ 35349 44 55
#> 6 2014-01-06 10:00:00.000000+00:00,2014-0~ 35349 44 56
#> # ... with more rows
# Get the largest 4 values of the 'temperature' column for each group with ties of non-sequenced PTI table.
df_nonseq_top <- df_nonseq_grp %>% summarise(top_temp = ts.top(temperature, 4, TRUE))
# Print the results.
df_nonseq_top %>% arrange(TIMECODE_RANGE, top_temp)
#> # Source: lazy query [?? x 3]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, top_temp
#> TIMECODE_RANGE `GROUP BY TIME(CAL_YEA~ top_temp
#> <chr> <int64> <int>
#> 1 2014-01-01 00:00:00.000000+00:00,2015-01-01 ~ 3 81
#> 2 2014-01-01 00:00:00.000000+00:00,2015-01-01 ~ 3 82
#> 3 2014-01-01 00:00:00.000000+00:00,2015-01-01 ~ 3 99
#> 4 2014-01-01 00:00:00.000000+00:00,2015-01-01 ~ 3 100
# Get the largest 2 values of the 'temperature' column for each group with ties of non-PTI table.
df_nonpti_top <- df_nonpti_grp %>% summarise(top_temp = ts.top(temperature, 2, TRUE))
# Print the results.
df_nonpti_top %>% arrange(TIMECODE_RANGE, top_temp)
#> # Source: lazy query [?? x 3]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, top_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUTE~ top_temp
#> <chr> <int64> <int>
#> 1 2014-01-06 08:00:00.000000+00:00,2014-01-06 0~ 23149921 10
#> 2 2014-01-06 08:01:00.000000+00:00,2014-01-06 0~ 23149922 10
#> 3 2014-01-06 08:02:00.000000+00:00,2014-01-06 0~ 23149923 10
#> 4 2014-01-06 08:03:00.000000+00:00,2014-01-06 0~ 23149924 10
#> 5 2014-01-06 08:04:00.000000+00:00,2014-01-06 0~ 23149925 10
#> 6 2014-01-06 08:05:00.000000+00:00,2014-01-06 0~ 23149926 10
#> # ... with more rows
Notes
- The data type of the column containing top values is same as that of the column for which top is to be computed.
- NULLs/NAs are not included in the result computation.
ts.top
is valid only for numeric data.ts.top
must be used withgroup_by_time()
.ts.top
with ties implies that the rows returned include the specified number of rows in the ordered set for each timebucket. It includes any rows where the sort key value is the same as the sort key value in the last row that satisfies the specified number or percentage of rows. If this clause is omitted and ties are found, the earliest value in terms of timecode is returned.- Use of
distinct
is not allowed forts.top
function.
Median : ts.median()
The aggregate function ts.median
returns the median of all values in each group. The function returns the average of the two middle values if the argument value.expression
contains an even number of values. ts.median
uses two arguments:
value.expression
specifies the column for which median is to be computed.use.distinct
specifies whether to exclude duplicates specified by value.expression from the computation. Default value is FALSE. By default, all non-null values in the column specified in value.expression, including duplicates, are considered in the computation. Ifuse.distinct = TRUE
, then all duplicate values are excluded in the computation.
# Calculate the median of the 'temperature' column of sequenced PTI table.
df_seq_median <- df_seq_grp %>% summarise(median_temp = ts.median(temperature))
# Print the results.
df_seq_median %>% arrange(TIMECODE_RANGE, buoyid, median_temp)
#> # Source: lazy query [?? x 4]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, buoyid, median_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUT~ buoyid median_temp
#> <chr> <int64> <int> <dbl>
#> 1 2014-01-06 08:00:00.000000+00:00,201~ 35345 0 54.5
#> 2 2014-01-06 09:00:00.000000+00:00,201~ 35347 1 74.5
#> 3 2014-01-06 10:00:00.000000+00:00,201~ 35349 44 53
#> 4 2014-01-06 10:30:00.000000+00:00,201~ 35350 22 23
#> 5 2014-01-06 10:30:00.000000+00:00,201~ 35350 44 43
#> 6 2014-01-06 21:00:00.000000+00:00,201~ 35371 2 81
# Calculate the median of the 'temperature' column of non-sequenced PTI table, excluding the duplicates in the computation.
df_nonseq_median <- df_nonseq_grp %>% summarise(median_temp = ts.median(temperature, TRUE))
# Another way of excluding duplicates for median.
df_nonseq_median <- df_nonseq_grp %>% summarise(median_temp = ts.median(distinct(temperature)))
# Print the results.
df_nonseq_median %>% arrange(TIMECODE_RANGE, median_temp)
#> # Source: lazy query [?? x 3]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, median_temp
#> TIMECODE_RANGE `GROUP BY TIME(CAL_YEA~ median_temp
#> <chr> <int64> <dbl>
#> 1 2014-01-01 00:00:00.000000+00:00,2015-01-~ 3 72
# Calculate the median of the 'temperature' column of non-PTI table.
df_nonpti_median <- df_nonpti_grp %>% summarise(median_temp = ts.median(temperature))
# Print the results.
df_nonpti_median %>% arrange(TIMECODE_RANGE, median_temp)
#> # Source: lazy query [?? x 3]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, median_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUTE~ median_temp
#> <chr> <int64> <dbl>
#> 1 2014-01-06 08:00:00.000000+00:00,2014-01-0~ 23149921 10
#> 2 2014-01-06 08:01:00.000000+00:00,2014-01-0~ 23149922 10
#> 3 2014-01-06 08:02:00.000000+00:00,2014-01-0~ 23149923 10
#> 4 2014-01-06 08:03:00.000000+00:00,2014-01-0~ 23149924 10
#> 5 2014-01-06 08:04:00.000000+00:00,2014-01-0~ 23149925 10
#> 6 2014-01-06 08:05:00.000000+00:00,2014-01-0~ 23149926 10
#> # ... with more rows
Notes
- The column containing median values is of NUMERIC data type.
- NULLs/NAs are not included in the result computation.
ts.median
is valid only for numeric data.- The use of
use.distinct = TRUE
is valid only when the functionts.median
is used withgroup_by_time
. The functionts.median
can be used as regular aggregate operation only when the argumentuse.distinct
takesFALSE
. Otherwise, an error is thrown. However, Teradata recommends to strictly usets.median
function only for time series aggregate operations. - The function
ts.median
cannot be used with dplyr verbs likemutate
,filter
as shown in the example below.
# Incorrect usage of ts.median() with mutate and filter
df_median <- df_seq %>% mutate(median_t = ts.median(temperature))
df_median <- df_seq %>% filter(temperature <= ts.median(temperature))
First : ts.first()
The aggregate function ts.first
returns the oldest value, determined by timecode, for each group. In the event of a tie, such as simultaneous timecode values for a particular group, all tied results are returned. If a sequence number is present with the data, it can break a tie, assuming it is unique across identical timecode values. ts.first
uses the argument value.expression
to specify the column for which oldest value of the group is to be returned.
# Get the oldest value of the 'temperature' column of sequenced PTI table.
df_seq_first <- df_seq_grp %>% summarise(first_temp = ts.first(temperature))
# Print the results.
df_seq_first %>% arrange(TIMECODE_RANGE, buoyid, first_temp)
#> # Source: lazy query [?? x 4]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, buoyid, first_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUT~ buoyid first_temp
#> <chr> <int64> <int> <int>
#> 1 2014-01-06 08:00:00.000000+00:00,2014~ 35345 0 10
#> 2 2014-01-06 09:00:00.000000+00:00,2014~ 35347 1 70
#> 3 2014-01-06 10:00:00.000000+00:00,2014~ 35349 44 43
#> 4 2014-01-06 10:30:00.000000+00:00,2014~ 35350 22 23
#> 5 2014-01-06 10:30:00.000000+00:00,2014~ 35350 44 43
#> 6 2014-01-06 21:00:00.000000+00:00,2014~ 35371 2 80
# Get the oldest value of the 'temperature' column of non-sequenced PTI table.
df_nonseq_first <- df_nonseq_grp %>% summarise(first_temp = ts.first(temperature))
# Print the results.
df_nonseq_first %>% arrange(TIMECODE_RANGE, first_temp)
#> # Source: lazy query [?? x 3]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, first_temp
#> TIMECODE_RANGE `GROUP BY TIME(CAL_YEA~ first_temp
#> <chr> <int64> <int>
#> 1 2014-01-01 00:00:00.000000+00:00,2015-01-0~ 3 10
# Get the oldest value of the 'temperature' column of non-PTI table.
df_nonpti_first <- df_nonpti_grp %>% summarise(first_temp = ts.first(temperature))
# Print the results.
df_nonpti_first %>% arrange(TIMECODE_RANGE, first_temp)
#> # Source: lazy query [?? x 3]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, first_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUTE~ first_temp
#> <chr> <int64> <int>
#> 1 2014-01-06 08:00:00.000000+00:00,2014-01-06~ 23149921 10
#> 2 2014-01-06 08:01:00.000000+00:00,2014-01-06~ 23149922 10
#> 3 2014-01-06 08:02:00.000000+00:00,2014-01-06~ 23149923 10
#> 4 2014-01-06 08:03:00.000000+00:00,2014-01-06~ 23149924 10
#> 5 2014-01-06 08:04:00.000000+00:00,2014-01-06~ 23149925 10
#> 6 2014-01-06 08:05:00.000000+00:00,2014-01-06~ 23149926 10
#> # ... with more rows
Notes
- The data type of the column containing oldest values is same as that of the column for which oldest value of the group is to be returned.
- NULLs/NAs are not included in the result computation.
ts.first
is valid only for numeric data.ts.first
must be used withgroup_by_time()
.- Use of
distinct
is not allowed forts.first
function.
Last : ts.last()
The aggregate function ts.last
returns the newest value, determined by timecode, for each group. In the event of a tie, such as simultaneous timecode values for a particular group, all tied results are returned. If a sequence number is present with the data, it can break a tie, assuming it is unique across identical timecode values. ts.last
uses the argument value.expression
to specify the column for which newest value of the group is to be returned.
# Get the newest value of the 'temperature' column of sequenced PTI table.
df_seq_last <- df_seq_grp %>% summarise(last_temp = ts.last(temperature))
# Print the results.
df_seq_last %>% arrange(TIMECODE_RANGE, buoyid, last_temp)
#> # Source: lazy query [?? x 4]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, buoyid, last_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUT~ buoyid last_temp
#> <chr> <int64> <int> <int>
#> 1 2014-01-06 08:00:00.000000+00:00,2014-~ 35345 0 100
#> 2 2014-01-06 09:00:00.000000+00:00,2014-~ 35347 1 79
#> 3 2014-01-06 10:00:00.000000+00:00,2014-~ 35349 44 43
#> 4 2014-01-06 10:30:00.000000+00:00,2014-~ 35350 22 23
#> 5 2014-01-06 10:30:00.000000+00:00,2014-~ 35350 44 43
#> 6 2014-01-06 21:00:00.000000+00:00,2014-~ 35371 2 82
# Get the newest value of the 'temperature' column of non-sequenced PTI table.
df_nonseq_last <- df_nonseq_grp %>% summarise(last_temp = ts.last(temperature))
# Print the results.
df_nonseq_last %>% arrange(TIMECODE_RANGE, last_temp)
#> # Source: lazy query [?? x 3]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, last_temp
#> TIMECODE_RANGE `GROUP BY TIME(CAL_YEA~ last_temp
#> <chr> <int64> <int>
#> 1 2014-01-01 00:00:00.000000+00:00,2015-01-01~ 3 82
# Get the newest value of the 'temperature' column of non-PTI table.
df_nonpti_last <- df_nonpti_grp %>% summarise(last_temp = ts.last(temperature))
# Print the results.
df_nonpti_last %>% arrange(TIMECODE_RANGE, last_temp)
#> # Source: lazy query [?? x 3]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, last_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUTE~ last_temp
#> <chr> <int64> <int>
#> 1 2014-01-06 08:00:00.000000+00:00,2014-01-06 ~ 23149921 10
#> 2 2014-01-06 08:01:00.000000+00:00,2014-01-06 ~ 23149922 10
#> 3 2014-01-06 08:02:00.000000+00:00,2014-01-06 ~ 23149923 10
#> 4 2014-01-06 08:03:00.000000+00:00,2014-01-06 ~ 23149924 10
#> 5 2014-01-06 08:04:00.000000+00:00,2014-01-06 ~ 23149925 10
#> 6 2014-01-06 08:05:00.000000+00:00,2014-01-06 ~ 23149926 10
#> # ... with more rows
Notes
- The data type of the column containing newest values is same as that of the column for which newest value of the group is to be returned.
- NULLs/NAs are not included in the result computation.
ts.last
is valid only for numeric data.- Use of
distinct
is not allowed forts.last
function. ts.last
must be used withgroup_by_time()
. Whents.last()
is used withgroup_by
but notgroup_by_time
,ts.last()
throws different exceptions because LAST is also a function which can operate on Period data types, unlikets.first()
which throws an exception “Time Series Aggregate function invoked without an appropriate GROUP BY TIME clause”. The following are such invalid use cases ofts.last()
function:
# Invalid example 1: Using ts.last() on `group_by` with aggregate column not as grouping column.
df1 <- df_seq %>% group_by(buoyid) %>% summarise(temp = ts.last(temperature))
# Note that this will not print the result but throws an exception - "Selected non-aggregate values must be part of the associated group.". This is because the aggregate operation "LAST" (considered here) is the Period data type function.
df1
# Invalid example 2: Using ts.last() on `group_by` with aggregate column as one of the grouping columns.
df2 <- df %>% group_by(buoyid, temperature) %>% summarise(temp = ts.last(temperature))
# Note that this will not print the result but throws an exception - "Invalid argument for the LAST function. The argument must have a Period data type.". This is because the aggregate operation "LAST" (considered here) is the Period data type function.
df2
Median Absolute Deviation : ts.mad()
The aggregate function ts.mad
returns the median of the set of values defined as the absolute value of the difference between each value and the median of all values in each group. ts.mad
uses two arguments:
value.expression
which specifies the column from which top values are to be returned.multiplier
which specifies a literal numeric constant. If this argument is NULL, default multipler 1.4826 is used for MAD computation.
Formula for computing MAD is as follows:
MAD = b * Mi(|Xi - Mj(Xj)|)
Where,
b = A numeric constant. Default value is 1.4826.
Mj(Xj) = Median of the original set of values.
Xi = The original set of values.
Mi = Median of absolute value of the difference between each value in Xi and the Median calculated in Mj(Xj).
# Calculate the MAD value of the 'temperature' column of sequenced PTI table.
df_seq_mad <- df_seq_grp %>% summarise(mad_temp = ts.mad(temperature))
# Print the results.
df_seq_mad %>% arrange(TIMECODE_RANGE, buoyid, mad_temp)
#> # Source: lazy query [?? x 4]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, buoyid, mad_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUT~ buoyid mad_temp
#> <chr> <int64> <int> <dbl>
#> 1 2014-01-06 08:00:00.000000+00:00,2014-0~ 35345 0 66.0
#> 2 2014-01-06 09:00:00.000000+00:00,2014-0~ 35347 1 5.19
#> 3 2014-01-06 10:00:00.000000+00:00,2014-0~ 35349 44 4.45
#> 4 2014-01-06 10:30:00.000000+00:00,2014-0~ 35350 22 0
#> 5 2014-01-06 10:30:00.000000+00:00,2014-0~ 35350 44 0
#> 6 2014-01-06 21:00:00.000000+00:00,2014-0~ 35371 2 1.48
# Calculate the MAD value of the 'temperature' column of non-sequenced PTI table.
df_nonseq_mad <- df_nonseq_grp %>% summarise(mad_temp = ts.mad(temperature, 2))
# Print the results.
df_nonseq_mad %>% arrange(TIMECODE_RANGE, mad_temp)
#> # Source: lazy query [?? x 3]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, mad_temp
#> TIMECODE_RANGE `GROUP BY TIME(CAL_YEA~ mad_temp
#> <chr> <int64> <dbl>
#> 1 2014-01-01 00:00:00.000000+00:00,2015-01-01 ~ 3 27
# Calculate the MAD value of the 'temperature' column of non-PTI table.
df_nonpti_mad <- df_nonpti_grp %>% summarise(mad_temp = ts.mad(temperature, 5))
# Print the results.
df_nonpti_mad %>% arrange(TIMECODE_RANGE, mad_temp)
#> # Source: lazy query [?? x 3]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, mad_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUTE~ mad_temp
#> <chr> <int64> <dbl>
#> 1 2014-01-06 08:00:00.000000+00:00,2014-01-06 0~ 23149921 0
#> 2 2014-01-06 08:01:00.000000+00:00,2014-01-06 0~ 23149922 0
#> 3 2014-01-06 08:02:00.000000+00:00,2014-01-06 0~ 23149923 0
#> 4 2014-01-06 08:03:00.000000+00:00,2014-01-06 0~ 23149924 0
#> 5 2014-01-06 08:04:00.000000+00:00,2014-01-06 0~ 23149925 0
#> 6 2014-01-06 08:05:00.000000+00:00,2014-01-06 0~ 23149926 0
#> # ... with more rows
Notes
- The column containing MAD values is of NUMERIC data type.
- NULLs/NAs are not included in the result computation.
ts.mad
is valid only for numeric data.ts.mad
must be used withgroup_by_time()
.- Use of
distinct
is not allowed forts.mad
function.
Percentile : ts.percentile()
The aggregate function ts.percentile
returns the value which represents the desired percentile from each group. ts.percentile
uses three arguments:
value.expression
specifies the column for which percentile value is to be computed.percentile
specifies the float value that represents the desired percentile.interpolation.type
specifies the type of interpolation to interpolate the result value when the desired result lies between two data points. Permitted values are LINEAR, LOW, HIGH, NEAREST and MIDPOINT.
The result value is determined by the desired index (di
) in an ordered list of values. The following equation is for the di
:
di = (number of values in group - 1) * percentile/100
When di
is a whole number, that value is the returned result. The di
can also be between two data points, i
and j
, where i < j
. In this case, the result is interpolated according to one of the following schemes:
- Linear interpolation: When the argument
interpolation.type
is set to LINEAR, the result value is computed using the equationresult = i + (j - i) * (di/100) MOD 1
. - Low value interpolation: When the argument
interpolation.type
is set to LOW, the result value is equal toi
. - High value interpolation: When the argument
interpolation.type
is set to HIGH, the result value is equal toj
. - Nearest value interpolation: When the argument
interpolation.type
is set to NEAREST, the result value isi
if(di / 100) MOD 1 <= 0.5
; Otherwise, it isj
. - Midpoint value interpolation: When the argument
interpolation.type
is set to MIDPOINT, the result value is equal to(i + j) / 2
.
Use percentile(distinct(column_name), ...)
to exclude duplicate rows while calculating percentile values.
# Calculate the 25th percentile of the 'temperature' column of sequenced PTI table. When the desired result value lies between two data points, linear interpolation is used as it is the default interpolation scheme.
df_seq_percentile <- df_seq_grp %>% summarise(percent_temp = ts.percentile(temperature, 25))
# Print the results.
df_seq_percentile %>% arrange(TIMECODE_RANGE, buoyid, percent_temp)
#> # Source: lazy query [?? x 4]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, buoyid, percent_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUT~ buoyid percent_temp
#> <chr> <int64> <int> <dbl>
#> 1 2014-01-06 08:00:00.000000+00:00,20~ 35345 0 10
#> 2 2014-01-06 09:00:00.000000+00:00,20~ 35347 1 71.2
#> 3 2014-01-06 10:00:00.000000+00:00,20~ 35349 44 43
#> 4 2014-01-06 10:30:00.000000+00:00,20~ 35350 22 23
#> 5 2014-01-06 10:30:00.000000+00:00,20~ 35350 44 43
#> 6 2014-01-06 21:00:00.000000+00:00,20~ 35371 2 80.5
# Calculate the 50th percentile of the 'temperature' column of non-sequenced PTI table. When the desired result value lies between two data points, nearest value interpolation is used.
df_nonseq_percentile <- df_nonseq_grp %>% summarise(percent_temp = ts.percentile(temperature, 50, "NEAREST"))
# Print the results.
df_nonseq_percentile %>% arrange(TIMECODE_RANGE, percent_temp)
#> # Source: lazy query [?? x 3]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, percent_temp
#> TIMECODE_RANGE `GROUP BY TIME(CAL_YE~ percent_temp
#> <chr> <int64> <dbl>
#> 1 2014-01-01 00:00:00.000000+00:00,2015-01-~ 3 54
# Calculate the 75th percentile of the distinct values in the 'temperature' column of non-PTI table. When the desired result value lies between two data points, low value interpolation is used.
df_nonpti_percentile <- df_nonpti_grp %>% summarise(percent_temp = ts.percentile(distinct(temperature), 75, "LOW"))
# Print the results.
df_nonpti_percentile %>% arrange(TIMECODE_RANGE, percent_temp)
#> # Source: lazy query [?? x 3]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, percent_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUTE~ percent_temp
#> <chr> <int64> <dbl>
#> 1 2014-01-06 08:00:00.000000+00:00,2014-01-~ 23149921 10
#> 2 2014-01-06 08:01:00.000000+00:00,2014-01-~ 23149922 10
#> 3 2014-01-06 08:02:00.000000+00:00,2014-01-~ 23149923 10
#> 4 2014-01-06 08:03:00.000000+00:00,2014-01-~ 23149924 10
#> 5 2014-01-06 08:04:00.000000+00:00,2014-01-~ 23149925 10
#> 6 2014-01-06 08:05:00.000000+00:00,2014-01-~ 23149926 10
#> # ... with more rows
Notes
- The column containing percentile values is of NUMERIC data type.
- NULLs/NAs are not included in the result computation.
ts.percentile
is valid only for numeric data.ts.percentile
must be used withgroup_by_time()
.
Delta_t : ts.delta_t()
The aggregate function ts.delta_t
calculates the time difference, or DELTA_T, between a starting and an ending event. The calculation is performed against a time-ordered time series data set. ts.delta_t
uses two arguments:
start.condition
specifies any supported filtering condition that defines the start of the time period for which you are searching.end.condition
specifies any supported filtering condition that defines the end of the time period for which you are searching.
# Example 1: Measures the time between minimum and maximum observed temperatures every 30 minutes between 8:00 AM and 10:30 AM on each buoy of a nonsequenced PTI table.
# Filter the data and grab all rows between timestamp '2014-01-06 08:00:00' and '2014-01-06 10:30:00'.
df_filter <- df_nonseq %>% filter(TD_TIMECODE >= "2014-01-06 08:00:00" && TD_TIMECODE < "2014-01-06 10:30:00")
# Get the minimum and maximum temperature within time range of 30 minutes.
df_min_max_temp <- df_filter %>% group_by_time("30m", value.expression = "buoyid", timecode.column = "TD_TIMECODE") %>%
summarise(min_t = min(temperature, na.rm = TRUE), max_t = max(temperature, na.rm = TRUE))
# Join the tbl_teradata 'df_min_max_temp' with original tbl_teradata 'df_nonseq'.
df_join <- inner_join(df_nonseq, df_min_max_temp, by = "buoyid")
# Execute 'ts.delta_t' after grouping the joined tbl_teradata into time buckets of 1 day.
df_grp1 <- df_join %>% group_by_time(timebucket.duration = "DAYS(1)", value.expression = "buoyid", timecode.column = "TD_TIMECODE")
df_out <- df_grp1 %>% summarise(delta_val = ts.delta_t(temperature == min_t, temperature == max_t))
# Print the results.
df_out %>% arrange(TIMECODE_RANGE, buoyid)
#> # Source: lazy query [?? x 4]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, buoyid
#> TIMECODE_RANGE `GROUP BY TIME(DA~ buoyid delta_val
#> <chr> <int64> <int> <chr>
#> 1 2014-01-06 00:00:00.000000+~ 16077 0 2014-01-06 08:00:00.00~
#> 2 2014-01-06 00:00:00.000000+~ 16077 1 2014-01-06 09:01:25.12~
#> 3 2014-01-06 00:00:00.000000+~ 16077 44 2014-01-06 10:00:26.12~
# Using strings to arguments 'start.condition' and 'end.condition'. This will give the same output as that of 'df_out'.
df_out1 <- df_grp1 %>% summarise(delta_val = ts.delta_t(start.condition = "temperature = min_t", end.condition = "temperature = max_t"))
# Print the results.
df_out1 %>% arrange(TIMECODE_RANGE, buoyid)
#> # Source: lazy query [?? x 4]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, buoyid
#> TIMECODE_RANGE `GROUP BY TIME(DA~ buoyid delta_val
#> <chr> <int64> <int> <chr>
#> 1 2014-01-06 00:00:00.000000+~ 16077 0 2014-01-06 08:00:00.00~
#> 2 2014-01-06 00:00:00.000000+~ 16077 1 2014-01-06 09:01:25.12~
#> 3 2014-01-06 00:00:00.000000+~ 16077 44 2014-01-06 10:00:26.12~
# Example 2: Finding Time Elapsed between Shipping and Receiving an Item. Input data used for this example contains information about parcels sent by a delivery service.
# Case 1: Using tbl_teradata on PTI Table and showcasing usage of unbounded time in grouping.
# Execute group_by_time() using unbounded time for timebucket.duration.
df_grp <- df_pack_pti %>% group_by_time(timebucket.duration = "*", value.expression = "parcel_number")
# Execute 'ts.delta_t', with start and end conditions specified as strings.
df_out <- df_grp %>% summarise(delta_t = ts.delta_t("Status LIKE 'picked%up%customer'", "Status LIKE 'delivered%customer'"))
# Print the results.
df_out %>% arrange(TIMECODE_RANGE, parcel_number)
#> # Source: lazy query [?? x 4]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, parcel_number
#> TIMECODE_RANGE `GROUP BY TIME(~ parcel_number delta_t
#> <chr> <int64> <int> <chr>
#> 1 2012-01-01 00:00:00.00000~ 1 55 2016-10-15 08:00:00~
#> 2 2012-01-01 00:00:00.00000~ 1 75 2016-10-15 08:00:00~
# Case 2: Using tbl_teradata on Non-PTI Table and showcasing usage of unbounded time in grouping.
# Execute group_by_time() using unbounded time for timebucket.duration.
df_grp <- df_pack_nonpti %>% group_by_time(timebucket.duration = "*", value.expression = "parcel_number", timecode.column = "clock_time")
# Execute 'ts.delta_t', with start and end conditions specified as strings.
df_out <- df_grp %>% summarise(delta_t = ts.delta_t("Status LIKE 'picked%up%customer'", "Status LIKE 'delivered%customer'"))
# Print the results.
df_out %>% arrange(TIMECODE_RANGE, parcel_number)
#> # Source: lazy query [?? x 4]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, parcel_number
#> TIMECODE_RANGE `GROUP BY TIME(~ parcel_number delta_t
#> <chr> <int64> <int> <chr>
#> 1 1970-01-01 00:00:00.00000~ 1 55 2016-10-15 08:00:00~
#> 2 1970-01-01 00:00:00.00000~ 1 75 2016-10-15 08:00:00~
Notes
- This is the only Time Series Aggregate function that works with timebucket_duration as "*" in
group_by_time()
, i.e., unbounded time. - When using
group_by_time()
with unbounded time, the following rules apply to the system virtual columns:$TD_GROUP_BY_TIME
: Always has a value of 1, since there is only one timebucket.$TD_TIMECODE_RANGE
: Composed of the first and last timecode values read for the group.
- The
ts.delta_t
function cannot be combined with any other functions. - The data being evaluated in the filtering conditions (for example, the minimum and maximum temperature observations) must belong to the timecode value present in the same row of data. This is the expected behavior. However, this assumption can be violated when joining multiple tables together. It is possible to construct a query where the result of a join causes specific data points (for example, a temperature reading) to be present in a data row with a timecode that is not indicative of when that data point occurred. In such a scenario, it is highly likely that the results are not as expected, or are misleading. Vantage does not detect these types of queries, so one must make sure to preserve the correlation between data points and timecodes.
- Function returns a column of PERIOD(TIMESTAMP WITH TIME ZONE) type (SQL datatype) composed of the start and end timecode, i.e., timecode column used for aggregation of each start-end pair.
- One result is returned per complete start-end pair found within the GROUP BY TIME window. The start-end pair process is as follows:
- If the current source data meets the start condition, the current timecode is saved as the start time.
- If the current source data meets the end condition, and a saved start timecode already exists, the start timecode is saved with the end timecode encountered as a result pair.
- The processing algorithm implies that multiple results may be found in each group.
- If no start-end pair is encountered, no result row is returned.
- Any result of
ts.delta_t
which has a delta less than 1 microsecond (including a delta of 0, in the case of a result which comes from a single point in time) is automatically rounded to 1 microsecond. This is strictly enforced to match Period data type semantics in Vantage which dictate that a starting and ending bound of a Period type may not be equivalent. The smallest granularity supported in Vantage is the microsecond, so these results are rounded accordingly. ts.delta_t
must be used withgroup_by_time()
.
Kurtosis : ts.kurtosis()
The aggregate function ts.kurtosis
measures the tailedness of the probability distribution of a column in each group. Kurtosis is the fourth moment of the distribution of the standardized (z) values. It is a measure of the outlier (rare, extreme observation) character of the distribution as compared to the normal (Gaussian) distribution.
- The normal distribution has a kurtosis of 0.
- Positive kurtosis indicates that the distribution is more outlier-prone (deviation from the mean) than the normal distribution.
- Negative kurtosis indicates that the distribution is less outlier-prone (deviation from the mean) than the normal distribution.
ts.kurtosis
uses the argument value.expression
to specify the column for which kurtosis is to be computed. Use ts.kurtosis(distinct(column_name))
to exclude duplicate rows while calculating kurtosis values.
# Calculate the Kurtosis of the 'temperature' column of sequenced PTI table.
df_seq_kurtosis <- df_seq_grp %>% summarise(kurtosis_temp = ts.kurtosis(temperature))
# Print the results.
df_seq_kurtosis %>% arrange(TIMECODE_RANGE, buoyid, kurtosis_temp)
#> # Source: lazy query [?? x 4]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, buoyid, kurtosis_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINU~ buoyid kurtosis_temp
#> <chr> <int64> <int> <dbl>
#> 1 2014-01-06 08:00:00.000000+00:00,20~ 35345 0 -6.00
#> 2 2014-01-06 09:00:00.000000+00:00,20~ 35347 1 -2.76
#> 3 2014-01-06 10:00:00.000000+00:00,20~ 35349 44 -2.31
#> 4 2014-01-06 10:30:00.000000+00:00,20~ 35350 22 NA
#> 5 2014-01-06 10:30:00.000000+00:00,20~ 35350 44 NA
#> 6 2014-01-06 21:00:00.000000+00:00,20~ 35371 2 NA
# Exclude duplicates in the same aggregate operation above.
df_seq_kurtosis <- df_seq_grp %>% summarise(kurtosis_temp = ts.kurtosis(distinct(temperature)))
# Print the results.
df_seq_kurtosis %>% arrange(TIMECODE_RANGE, buoyid, kurtosis_temp)
#> # Source: lazy query [?? x 4]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, buoyid, kurtosis_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINU~ buoyid kurtosis_temp
#> <chr> <int64> <int> <dbl>
#> 1 2014-01-06 08:00:00.000000+00:00,20~ 35345 0 NA
#> 2 2014-01-06 09:00:00.000000+00:00,20~ 35347 1 -2.76
#> 3 2014-01-06 10:00:00.000000+00:00,20~ 35349 44 4.13
#> 4 2014-01-06 10:30:00.000000+00:00,20~ 35350 22 NA
#> 5 2014-01-06 10:30:00.000000+00:00,20~ 35350 44 NA
#> 6 2014-01-06 21:00:00.000000+00:00,20~ 35371 2 NA
# Calculate the Kurtosis of the 'temperature' column of non-sequenced PTI table.
df_nonseq_kurtosis <- df_nonseq_grp %>% summarise(kurtosis_temp = ts.kurtosis(temperature))
# Print the results.
df_nonseq_kurtosis %>% arrange(TIMECODE_RANGE, kurtosis_temp)
#> # Source: lazy query [?? x 3]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, kurtosis_temp
#> TIMECODE_RANGE `GROUP BY TIME(CAL_YE~ kurtosis_temp
#> <chr> <int64> <dbl>
#> 1 2014-01-01 00:00:00.000000+00:00,2015-01~ 3 0.0892
# Calculate the Kurtosis of the 'temperature' column of non-PTI table.
df_nonpti_kurtosis <- df_nonpti %>% group_by_time(timebucket.duration = "10m", timecode.column = "TIMECODE") %>% summarise(kurtosis_temp = ts.kurtosis(temperature))
# Print the results.
df_nonpti_kurtosis %>% arrange(TIMECODE_RANGE, kurtosis_temp)
#> # Source: lazy query [?? x 3]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, kurtosis_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUTE~ kurtosis_temp
#> <chr> <int64> <dbl>
#> 1 2014-01-06 08:00:00.000000+00:00,2014-01~ 2314993 NA
#> 2 2014-01-06 08:10:00.000000+00:00,2014-01~ 2314994 NA
#> 3 2014-01-06 09:00:00.000000+00:00,2014-01~ 2314999 -2.76
#> 4 2014-01-06 10:00:00.000000+00:00,2014-01~ 2315005 -2.18
#> 5 2014-01-06 10:10:00.000000+00:00,2014-01~ 2315006 NA
#> 6 2014-01-06 10:30:00.000000+00:00,2014-01~ 2315008 NA
#> # ... with more rows
Notes
- NULLs/NAs are not included in the result computation.
- Teradata recommends to strictly use
ts.kurtosis
for time series aggregate operations, instead of regular aggregate functionkurtosis
. - Following conditions will produce NULL result:
- Fewer than three non-NULL data points in the data used for the computation.
- Standard deviation for a column is equal to 0.
ts.kurtosis
cannot be used with dplyr verbs likemutate
,filter
as shown in the example below.
# Incorrect usage of ts.kurtosis() with mutate and filter.
df_kurtosis <- df_seq %>% mutate(kurtosis_t = ts.kurtosis(temperature))
df_kurtosis <- df_seq %>% filter(temperature <= ts.kurtosis(temperature))
Skew : ts.skew()
The aggregate function ts.skew
measures the skewness of the distribution of a column. Skewness is the third moment of a distribution. It is a measure of the asymmetry of the distribution about its mean compared to the normal (Gaussian) distribution.
- The normal distribution has a skewness of 0.
- Positive skewness indicates the distribution having an asymmetric tail extending toward more positive values.
- Negative skewness indicates the distribution having an asymmetric tail extending toward more negative values.
ts.skew
uses the argument value.expression
to specify the column for which skew is to be computed. Use ts.skew(distinct(column_name))
to exclude duplicate rows while calculating skew values.
# Calculate the skewness of the 'temperature' column of sequenced PTI table.
df_seq_skew <- df_seq_grp %>% summarise(skew_temp = ts.skew(temperature))
# Print the results.
df_seq_skew %>% arrange(TIMECODE_RANGE, buoyid, skew_temp)
#> # Source: lazy query [?? x 4]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, buoyid, skew_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUT~ buoyid skew_temp
#> <chr> <int64> <int> <dbl>
#> 1 2014-01-06 08:00:00.000000+00:00,2014-~ 35345 0 0.000324
#> 2 2014-01-06 09:00:00.000000+00:00,2014-~ 35347 1 0
#> 3 2014-01-06 10:00:00.000000+00:00,2014-~ 35349 44 -0.127
#> 4 2014-01-06 10:30:00.000000+00:00,2014-~ 35350 22 NA
#> 5 2014-01-06 10:30:00.000000+00:00,2014-~ 35350 44 NA
#> 6 2014-01-06 21:00:00.000000+00:00,2014-~ 35371 2 0
# Exclude duplicates in the same aggregate operation above.
df_seq_skew <- df_seq_grp %>% summarise(skew_temp = ts.skew(distinct(temperature)))
# Print the results.
df_seq_skew %>% arrange(TIMECODE_RANGE, buoyid, skew_temp)
#> # Source: lazy query [?? x 4]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, buoyid, skew_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUT~ buoyid skew_temp
#> <chr> <int64> <int> <dbl>
#> 1 2014-01-06 08:00:00.000000+00:00,2014-~ 35345 0 -1.73
#> 2 2014-01-06 09:00:00.000000+00:00,2014-~ 35347 1 0
#> 3 2014-01-06 10:00:00.000000+00:00,2014-~ 35349 44 -1.99
#> 4 2014-01-06 10:30:00.000000+00:00,2014-~ 35350 22 NA
#> 5 2014-01-06 10:30:00.000000+00:00,2014-~ 35350 44 NA
#> 6 2014-01-06 21:00:00.000000+00:00,2014-~ 35371 2 0
# Calculate the skewness of the 'temperature' column of non-sequenced PTI table.
df_nonseq_skew <- df_nonseq_grp %>% summarise(skew_temp = ts.skew(temperature))
# Print the results.
df_nonseq_skew %>% arrange(TIMECODE_RANGE, skew_temp)
#> # Source: lazy query [?? x 3]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, skew_temp
#> TIMECODE_RANGE `GROUP BY TIME(CAL_YEA~ skew_temp
#> <chr> <int64> <dbl>
#> 1 2014-01-01 00:00:00.000000+00:00,2015-01-01~ 3 -0.270
# Calculate the skewness of the 'temperature' column of non-PTI table.
df_nonpti_skew <- df_nonpti_grp %>% group_by_time(timebucket.duration = "10m", timecode.column = "TIMECODE") %>% summarise(skew_temp = ts.skew(temperature))
# Print the results.
df_nonpti_skew %>% arrange(TIMECODE_RANGE, skew_temp)
#> # Source: lazy query [?? x 3]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, skew_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUTES~ skew_temp
#> <chr> <int64> <dbl>
#> 1 2014-01-06 08:00:00.000000+00:00,2014-01-06~ 2314993 NA
#> 2 2014-01-06 08:10:00.000000+00:00,2014-01-06~ 2314994 NA
#> 3 2014-01-06 09:00:00.000000+00:00,2014-01-06~ 2314999 0
#> 4 2014-01-06 10:00:00.000000+00:00,2014-01-06~ 2315005 -0.384
#> 5 2014-01-06 10:10:00.000000+00:00,2014-01-06~ 2315006 NA
#> 6 2014-01-06 10:30:00.000000+00:00,2014-01-06~ 2315008 NA
#> # ... with more rows
Notes
- NULLs/NAs are not included in the result computation.
- Teradata recommends to strictly use
ts.skew
for time series aggregate operations, instead of regular aggregate functionskew
.
- Teradata recommends to strictly use
- Following conditions will produce NULL result:
- Fewer than three non-NULLl data points in the data used for the computation.
- Standard deviation for a column is equal to 0.
ts.skew
cannot be used with dplyr verbs likemutate
,filter
as shown in the example below.
# Incorrect usage of ts.skew() with mutate and filter.
df_skew <- df_seq %>% mutate(skew_t = ts.skew(temperature))
df_skew <- df_seq %>% filter(temperature <= ts.skew(temperature))
Sum : ts.sum()
The aggregate function ts.sum
returns the sum of values in the column grouped by time. ts.sum
uses the argument value.expression
to specify the column for which sum is to be computed. Use ts.sum(distinct(column_name))
to exclude duplicate rows while calculating sum.
# Calculate the sum of the values in the 'temperature' column of sequenced PTI table.
df_seq_sum <- df_seq_grp %>% summarise(sum_temp = ts.sum(temperature))
# Print the results.
df_seq_sum %>% arrange(TIMECODE_RANGE, buoyid, sum_temp)
#> # Source: lazy query [?? x 4]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, buoyid, sum_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUT~ buoyid sum_temp
#> <chr> <int64> <int> <int>
#> 1 2014-01-06 08:00:00.000000+00:00,2014-0~ 35345 0 219
#> 2 2014-01-06 09:00:00.000000+00:00,2014-0~ 35347 1 447
#> 3 2014-01-06 10:00:00.000000+00:00,2014-0~ 35349 44 539
#> 4 2014-01-06 10:30:00.000000+00:00,2014-0~ 35350 22 23
#> 5 2014-01-06 10:30:00.000000+00:00,2014-0~ 35350 44 86
#> 6 2014-01-06 21:00:00.000000+00:00,2014-0~ 35371 2 243
# Exclude duplicates in the same aggregate operation above.
df_seq_sum <- df_seq_grp %>% summarise(sum_temp = ts.sum(distinct(temperature)))
# Print the results.
df_seq_sum %>% arrange(TIMECODE_RANGE, buoyid, sum_temp)
#> # Source: lazy query [?? x 4]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, buoyid, sum_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUT~ buoyid sum_temp
#> <chr> <int64> <int> <int>
#> 1 2014-01-06 08:00:00.000000+00:00,2014-0~ 35345 0 209
#> 2 2014-01-06 09:00:00.000000+00:00,2014-0~ 35347 1 447
#> 3 2014-01-06 10:00:00.000000+00:00,2014-0~ 35349 44 261
#> 4 2014-01-06 10:30:00.000000+00:00,2014-0~ 35350 22 23
#> 5 2014-01-06 10:30:00.000000+00:00,2014-0~ 35350 44 43
#> 6 2014-01-06 21:00:00.000000+00:00,2014-0~ 35371 2 243
# Calculate the sum of the values in the 'temperature' column of non-sequenced PTI table.
df_nonseq_sum <- df_nonseq_grp %>% summarise(sum_temp = ts.sum(temperature))
# Print the results.
df_nonseq_sum %>% arrange(TIMECODE_RANGE, sum_temp)
#> # Source: lazy query [?? x 3]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, sum_temp
#> TIMECODE_RANGE `GROUP BY TIME(CAL_YEA~ sum_temp
#> <chr> <int64> <int>
#> 1 2014-01-01 00:00:00.000000+00:00,2015-01-01 ~ 3 1534
# Calculate the sum of the values in the 'temperature' column of non-PTI table.
df_nonpti_sum <- df_nonpti %>% group_by_time(timebucket.duration = "10m", timecode.column = "TIMECODE") %>% summarise(sum_temp = ts.sum(temperature))
# Print the results.
df_nonpti_sum %>% arrange(TIMECODE_RANGE, sum_temp)
#> # Source: lazy query [?? x 3]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, sum_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUTES~ sum_temp
#> <chr> <int64> <int>
#> 1 2014-01-06 08:00:00.000000+00:00,2014-01-06 ~ 2314993 109
#> 2 2014-01-06 08:10:00.000000+00:00,2014-01-06 ~ 2314994 110
#> 3 2014-01-06 09:00:00.000000+00:00,2014-01-06 ~ 2314999 447
#> 4 2014-01-06 10:00:00.000000+00:00,2014-01-06 ~ 2315005 496
#> 5 2014-01-06 10:10:00.000000+00:00,2014-01-06 ~ 2315006 43
#> 6 2014-01-06 10:30:00.000000+00:00,2014-01-06 ~ 2315008 43
#> # ... with more rows
Notes
- NULLs/NAs are not included in the result computation.
- Teradata recommends to strictly use
ts.sum
for time series aggregate operations, instead of regular aggregate functionsum
. ts.sum
cannot be used with dplyr verbs likemutate
,filter
as shown in the example below.
# Incorrect usage of ts.sum() with mutate and filter.
df_sum <- df_seq %>% mutate(sum_t = ts.sum(temperature))
df_sum <- df_seq %>% filter(temperature <= ts.sum(temperature))
Sample Standard Deviation : ts.sd()
The aggregate function ts.sd
returns the sample standard deviation of values of the column grouped by time. The standard deviation is the second moment of a distribution. ts.sd
uses the argument value.expression
to specify the column for which sample standard deviation is to be computed. Use ts.sd(distinct(column_name))
to exclude duplicate rows while calculating sample standard deviation.
# Calculate the sample standard deviation in the 'temperature' column of sequenced PTI table.
df_seq_sd <- df_seq_grp %>% summarise(sd_temp = ts.sd(temperature))
# Print the results.
df_seq_sd %>% arrange(TIMECODE_RANGE, buoyid, sd_temp)
#> # Source: lazy query [?? x 4]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, buoyid, sd_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUTE~ buoyid sd_temp
#> <chr> <int64> <int> <dbl>
#> 1 2014-01-06 08:00:00.000000+00:00,2014-0~ 35345 0 51.7
#> 2 2014-01-06 09:00:00.000000+00:00,2014-0~ 35347 1 3.94
#> 3 2014-01-06 10:00:00.000000+00:00,2014-0~ 35349 44 5.81
#> 4 2014-01-06 10:30:00.000000+00:00,2014-0~ 35350 22 NA
#> 5 2014-01-06 10:30:00.000000+00:00,2014-0~ 35350 44 0
#> 6 2014-01-06 21:00:00.000000+00:00,2014-0~ 35371 2 1
# Exclude duplicates in the same aggregate operation above.
df_seq_sd <- df_seq_grp %>% summarise(sd_temp = ts.sd(distinct(temperature)))
# Print the results.
df_seq_sd %>% arrange(TIMECODE_RANGE, buoyid, sd_temp)
#> # Source: lazy query [?? x 4]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, buoyid, sd_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUTE~ buoyid sd_temp
#> <chr> <int64> <int> <dbl>
#> 1 2014-01-06 08:00:00.000000+00:00,2014-0~ 35345 0 51.7
#> 2 2014-01-06 09:00:00.000000+00:00,2014-0~ 35347 1 3.94
#> 3 2014-01-06 10:00:00.000000+00:00,2014-0~ 35349 44 5.26
#> 4 2014-01-06 10:30:00.000000+00:00,2014-0~ 35350 22 NA
#> 5 2014-01-06 10:30:00.000000+00:00,2014-0~ 35350 44 NA
#> 6 2014-01-06 21:00:00.000000+00:00,2014-0~ 35371 2 1
# Calculate the sample standard deviation in the 'temperature' column of non-sequenced PTI table.
df_nonseq_sd <- df_nonseq_grp %>% summarise(sd_temp = ts.sd(temperature))
# Print the results.
df_nonseq_sd %>% arrange(TIMECODE_RANGE, sd_temp)
#> # Source: lazy query [?? x 3]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, sd_temp
#> TIMECODE_RANGE `GROUP BY TIME(CAL_YEA~ sd_temp
#> <chr> <int64> <dbl>
#> 1 2014-01-01 00:00:00.000000+00:00,2015-01-01 0~ 3 22.8
# Calculate the sample standard deviation in the 'temperature' column of non-PTI table.
df_nonpti_sd <- df_nonpti %>% group_by_time(timebucket.duration = "10m", timecode.column = "TIMECODE") %>% summarise(sd_temp = ts.sd(temperature))
# Print the results.
df_nonpti_sd %>% arrange(TIMECODE_RANGE, sd_temp)
#> # Source: lazy query [?? x 3]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, sd_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUTES~ sd_temp
#> <chr> <int64> <dbl>
#> 1 2014-01-06 08:00:00.000000+00:00,2014-01-06 0~ 2314993 62.9
#> 2 2014-01-06 08:10:00.000000+00:00,2014-01-06 0~ 2314994 63.6
#> 3 2014-01-06 09:00:00.000000+00:00,2014-01-06 0~ 2314999 3.94
#> 4 2014-01-06 10:00:00.000000+00:00,2014-01-06 1~ 2315005 5.76
#> 5 2014-01-06 10:10:00.000000+00:00,2014-01-06 1~ 2315006 NA
#> 6 2014-01-06 10:30:00.000000+00:00,2014-01-06 1~ 2315008 NA
#> # ... with more rows
Notes
- NULLs/NAs are not included in the result computation.
- Teradata recommends to strictly use
ts.sd
for time series aggregate operations, instead of regular aggregate functionsd
. - When there are fewer than two non-NULL data points in the sample used for the computation,
ts.sd
returns NULL/NA. - Division by zero results in NULL/NA value rather than an error.
- If data represents only a sample of the entire population for the column, Teradata recommends to use
ts.sd
to calculate sample standard deviation instead ofts.sdp
which calculates population standard deviation. As the sample size increases, the values forts.sd
andts.sdp
approach the same number. ts.sd
cannot be used with dplyr verbs likemutate
,filter
as shown in the example below.
# Incorrect usage of ts.sd() with mutate and filter.
df_sd <- df_seq %>% mutate(sd_t = ts.sd(temperature))
df_sd <- df_seq %>% filter(temperature <= ts.sd(temperature))
Population Standard Deviation : ts.sdp()
The aggregate function ts.sdp
returns the population standard deviation of values of the column grouped by time. The standard deviation is the second moment of a distribution. ts.sdp
uses the argument value.expression
to specify the column for which population standard deviation is to be computed. Use ts.sdp(distinct(column_name))
to exclude duplicate rows while calculating population standard deviation.
# Calculate the population standard deviation in the 'temperature' column of sequenced PTI table.
df_seq_sdp <- df_seq_grp %>% summarise(sdp_temp = ts.sdp(temperature))
# Print the results.
df_seq_sdp %>% arrange(TIMECODE_RANGE, buoyid, sdp_temp)
#> # Source: lazy query [?? x 4]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, buoyid, sdp_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUT~ buoyid sdp_temp
#> <chr> <int64> <int> <dbl>
#> 1 2014-01-06 08:00:00.000000+00:00,2014-0~ 35345 0 44.8
#> 2 2014-01-06 09:00:00.000000+00:00,2014-0~ 35347 1 3.59
#> 3 2014-01-06 10:00:00.000000+00:00,2014-0~ 35349 44 5.54
#> 4 2014-01-06 10:30:00.000000+00:00,2014-0~ 35350 22 0
#> 5 2014-01-06 10:30:00.000000+00:00,2014-0~ 35350 44 0
#> 6 2014-01-06 21:00:00.000000+00:00,2014-0~ 35371 2 0.816
# Exclude duplicates in the same aggregate operation above.
df_seq_sdp <- df_seq_grp %>% summarise(sdp_temp = ts.sdp(distinct(temperature)))
# Print the results.
df_seq_sdp %>% arrange(TIMECODE_RANGE, buoyid, sdp_temp)
#> # Source: lazy query [?? x 4]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, buoyid, sdp_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUT~ buoyid sdp_temp
#> <chr> <int64> <int> <dbl>
#> 1 2014-01-06 08:00:00.000000+00:00,2014-0~ 35345 0 42.2
#> 2 2014-01-06 09:00:00.000000+00:00,2014-0~ 35347 1 3.59
#> 3 2014-01-06 10:00:00.000000+00:00,2014-0~ 35349 44 4.71
#> 4 2014-01-06 10:30:00.000000+00:00,2014-0~ 35350 22 0
#> 5 2014-01-06 10:30:00.000000+00:00,2014-0~ 35350 44 0
#> 6 2014-01-06 21:00:00.000000+00:00,2014-0~ 35371 2 0.816
# Calculate the population standard deviation in the 'temperature' column of non-sequenced PTI table.
df_nonseq_sdp <- df_nonseq_grp %>% summarise(sdp_temp = ts.sdp(temperature))
# Print the results.
df_nonseq_sdp %>% arrange(TIMECODE_RANGE, sdp_temp)
#> # Source: lazy query [?? x 3]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, sdp_temp
#> TIMECODE_RANGE `GROUP BY TIME(CAL_YEA~ sdp_temp
#> <chr> <int64> <dbl>
#> 1 2014-01-01 00:00:00.000000+00:00,2015-01-01 ~ 3 22.4
# Calculate the population standard deviation in the 'temperature' column of non-PTI table.
df_nonpti_sdp <- df_nonpti %>% group_by_time(timebucket.duration = "10m", timecode.column = "TIMECODE") %>% summarise(sdp_temp = ts.sdp(temperature))
# Print the results.
df_nonpti_sdp %>% arrange(TIMECODE_RANGE, sdp_temp)
#> # Source: lazy query [?? x 3]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, sdp_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUTES~ sdp_temp
#> <chr> <int64> <dbl>
#> 1 2014-01-06 08:00:00.000000+00:00,2014-01-06 ~ 2314993 44.5
#> 2 2014-01-06 08:10:00.000000+00:00,2014-01-06 ~ 2314994 45
#> 3 2014-01-06 09:00:00.000000+00:00,2014-01-06 ~ 2314999 3.59
#> 4 2014-01-06 10:00:00.000000+00:00,2014-01-06 ~ 2315005 5.46
#> 5 2014-01-06 10:10:00.000000+00:00,2014-01-06 ~ 2315006 0
#> 6 2014-01-06 10:30:00.000000+00:00,2014-01-06 ~ 2315008 0
#> # ... with more rows
Notes
- NULLs/NAs are not included in the result computation.
- Teradata recommends to strictly use
ts.sdp
for time series aggregate operations, instead of regular aggregate functionsdp
. - When there are fewer than two non-NULL data points in the population used for the computation,
ts.sdp
returns NULL/NA. - Division by zero results in NULL/NA value rather than an error.
ts.sdp
can only be used if data represents entire population. Otherwise, Teradata recommends to usets.sd
to calculate sample standard deviation.ts.sdp
cannot be used with dplyr verbs likemutate
,filter
as shown in the example below.
# Incorrect usage of ts.sdp() with mutate and filter.
df_sdp <- df_seq %>% mutate(sdp_t = ts.sdp(temperature))
df_sdp <- df_seq %>% filter(temperature <= ts.sdp(temperature))
Minimum : ts.min()
The aggregate function ts.min
returns the minimum value in the column grouped by time. ts.min
uses the argument value.expression
to specify the column for which minimum value is to be computed. Use ts.min(distinct(column_name))
to exclude duplicate rows while calculating minimum value.
# Calculate the minimum value in the 'temperature' column of sequenced PTI table.
df_seq_min <- df_seq_grp %>% summarise(min_temp = ts.min(temperature))
# Print the results.
df_seq_min %>% arrange(TIMECODE_RANGE, buoyid, min_temp)
#> # Source: lazy query [?? x 4]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, buoyid, min_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUT~ buoyid min_temp
#> <chr> <int64> <int> <int>
#> 1 2014-01-06 08:00:00.000000+00:00,2014-0~ 35345 0 10
#> 2 2014-01-06 09:00:00.000000+00:00,2014-0~ 35347 1 70
#> 3 2014-01-06 10:00:00.000000+00:00,2014-0~ 35349 44 43
#> 4 2014-01-06 10:30:00.000000+00:00,2014-0~ 35350 22 23
#> 5 2014-01-06 10:30:00.000000+00:00,2014-0~ 35350 44 43
#> 6 2014-01-06 21:00:00.000000+00:00,2014-0~ 35371 2 80
# Exclude duplicates in the same aggregate operation above.
df_seq_min <- df_seq_grp %>% summarise(min_temp = ts.min(distinct(temperature)))
# Print the results.
df_seq_min %>% arrange(TIMECODE_RANGE, buoyid, min_temp)
#> # Source: lazy query [?? x 4]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, buoyid, min_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUT~ buoyid min_temp
#> <chr> <int64> <int> <int>
#> 1 2014-01-06 08:00:00.000000+00:00,2014-0~ 35345 0 10
#> 2 2014-01-06 09:00:00.000000+00:00,2014-0~ 35347 1 70
#> 3 2014-01-06 10:00:00.000000+00:00,2014-0~ 35349 44 43
#> 4 2014-01-06 10:30:00.000000+00:00,2014-0~ 35350 22 23
#> 5 2014-01-06 10:30:00.000000+00:00,2014-0~ 35350 44 43
#> 6 2014-01-06 21:00:00.000000+00:00,2014-0~ 35371 2 80
# Calculate the minimum value in the 'temperature' column of non-sequenced PTI table.
df_nonseq_min <- df_nonseq_grp %>% summarise(min_temp = ts.min(temperature))
# Print the results.
df_nonseq_min %>% arrange(TIMECODE_RANGE, min_temp)
#> # Source: lazy query [?? x 3]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, min_temp
#> TIMECODE_RANGE `GROUP BY TIME(CAL_YEA~ min_temp
#> <chr> <int64> <int>
#> 1 2014-01-01 00:00:00.000000+00:00,2015-01-01 ~ 3 10
# Calculate the minimum value in the 'temperature' column of non-PTI table.
df_nonpti_min <- df_nonpti %>% group_by_time(timebucket.duration = "10m", timecode.column = "TIMECODE") %>% summarise(min_temp = ts.min(temperature))
# Print the results.
df_nonpti_min %>% arrange(TIMECODE_RANGE, min_temp)
#> # Source: lazy query [?? x 3]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, min_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUTES~ min_temp
#> <chr> <int64> <int>
#> 1 2014-01-06 08:00:00.000000+00:00,2014-01-06 ~ 2314993 10
#> 2 2014-01-06 08:10:00.000000+00:00,2014-01-06 ~ 2314994 10
#> 3 2014-01-06 09:00:00.000000+00:00,2014-01-06 ~ 2314999 70
#> 4 2014-01-06 10:00:00.000000+00:00,2014-01-06 ~ 2315005 43
#> 5 2014-01-06 10:10:00.000000+00:00,2014-01-06 ~ 2315006 43
#> 6 2014-01-06 10:30:00.000000+00:00,2014-01-06 ~ 2315008 43
#> # ... with more rows
Notes
- NULLs/NAs are not included in the result computation.
- Teradata recommends to strictly use
ts.min
for time series aggregate operations, instead of regular aggregate functionmin
. ts.min
cannot be used with dplyr verbs likemutate
,filter
as shown in the example below.
# Incorrect usage of ts.min() with mutate and filter.
df_min <- df_seq %>% mutate(min_t = ts.min(temperature))
df_min <- df_seq %>% filter(temperature <= ts.min(temperature))
Maximum : ts.max()
The aggregate function ts.max
returns the maximum value in the column grouped by time. ts.max
uses the argument value.expression
to specify the column for which maximum value is to be computed. Use ts.max(distinct(column_name))
to exclude duplicate rows while calculating maximum value.
# Calculate the maximum value in the 'temperature' column of sequenced PTI table.
df_seq_max <- df_seq_grp %>% summarise(max_temp = ts.max(temperature))
# Print the results.
df_seq_max %>% arrange(TIMECODE_RANGE, buoyid, max_temp)
#> # Source: lazy query [?? x 4]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, buoyid, max_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUT~ buoyid max_temp
#> <chr> <int64> <int> <int>
#> 1 2014-01-06 08:00:00.000000+00:00,2014-0~ 35345 0 100
#> 2 2014-01-06 09:00:00.000000+00:00,2014-0~ 35347 1 79
#> 3 2014-01-06 10:00:00.000000+00:00,2014-0~ 35349 44 56
#> 4 2014-01-06 10:30:00.000000+00:00,2014-0~ 35350 22 23
#> 5 2014-01-06 10:30:00.000000+00:00,2014-0~ 35350 44 43
#> 6 2014-01-06 21:00:00.000000+00:00,2014-0~ 35371 2 82
# Exclude duplicates in the same aggregate operation above.
df_seq_max <- df_seq_grp %>% summarise(max_temp = ts.max(distinct(temperature)))
# Print the results.
df_seq_max %>% arrange(TIMECODE_RANGE, buoyid, max_temp)
#> # Source: lazy query [?? x 4]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, buoyid, max_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUT~ buoyid max_temp
#> <chr> <int64> <int> <int>
#> 1 2014-01-06 08:00:00.000000+00:00,2014-0~ 35345 0 100
#> 2 2014-01-06 09:00:00.000000+00:00,2014-0~ 35347 1 79
#> 3 2014-01-06 10:00:00.000000+00:00,2014-0~ 35349 44 56
#> 4 2014-01-06 10:30:00.000000+00:00,2014-0~ 35350 22 23
#> 5 2014-01-06 10:30:00.000000+00:00,2014-0~ 35350 44 43
#> 6 2014-01-06 21:00:00.000000+00:00,2014-0~ 35371 2 82
# Calculate the maximum value in the 'temperature' column of non-sequenced PTI table.
df_nonseq_max <- df_nonseq_grp %>% summarise(max_temp = ts.max(temperature))
# Print the results.
df_nonseq_max %>% arrange(TIMECODE_RANGE, max_temp)
#> # Source: lazy query [?? x 3]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, max_temp
#> TIMECODE_RANGE `GROUP BY TIME(CAL_YEA~ max_temp
#> <chr> <int64> <int>
#> 1 2014-01-01 00:00:00.000000+00:00,2015-01-01 ~ 3 100
# Calculate the maximum value in the 'temperature' column of non-PTI table.
df_nonpti_max <- df_nonpti %>% group_by_time(timebucket.duration = "10m", timecode.column = "TIMECODE") %>% summarise(max_temp = ts.max(temperature))
# Print the results.
df_nonpti_max %>% arrange(TIMECODE_RANGE, max_temp)
#> # Source: lazy query [?? x 3]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, max_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUTES~ max_temp
#> <chr> <int64> <int>
#> 1 2014-01-06 08:00:00.000000+00:00,2014-01-06 ~ 2314993 99
#> 2 2014-01-06 08:10:00.000000+00:00,2014-01-06 ~ 2314994 100
#> 3 2014-01-06 09:00:00.000000+00:00,2014-01-06 ~ 2314999 79
#> 4 2014-01-06 10:00:00.000000+00:00,2014-01-06 ~ 2315005 56
#> 5 2014-01-06 10:10:00.000000+00:00,2014-01-06 ~ 2315006 43
#> 6 2014-01-06 10:30:00.000000+00:00,2014-01-06 ~ 2315008 43
#> # ... with more rows
Notes
- NULLs/NAs are not included in the result computation.
- Teradata recommends to strictly use
ts.max
for time series aggregate operations, instead of regular aggregate functionmax
. ts.max
cannot be used with dplyr verbs likemutate
,filter
as shown in the example below.
# Incorrect usage of ts.max() with mutate and filter.
df_max <- df_seq %>% mutate(max_t = ts.max(temperature))
df_max <- df_seq %>% filter(temperature <= ts.max(temperature))
Sample Variance : ts.var()
The aggregate function ts.var
returns the sample variance of values of the column grouped by time. The variance of a sample is a measure of dispersion from the mean of that sample. It is the square of the sample standard deviation. ts.var
uses the argument value.expression
to specify the column for which sample variance is to be computed. Use ts.var(distinct(column_name))
to exclude duplicate rows while calculating sample variance.
# Calculate the sample variance of values in the 'temperature' column of sequenced PTI table.
df_seq_var <- df_seq_grp %>% summarise(var_temp = ts.var(temperature))
# Print the results.
df_seq_var %>% arrange(TIMECODE_RANGE, buoyid, var_temp)
#> # Source: lazy query [?? x 4]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, buoyid, var_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUT~ buoyid var_temp
#> <chr> <int64> <int> <dbl>
#> 1 2014-01-06 08:00:00.000000+00:00,2014-0~ 35345 0 2670.
#> 2 2014-01-06 09:00:00.000000+00:00,2014-0~ 35347 1 15.5
#> 3 2014-01-06 10:00:00.000000+00:00,2014-0~ 35349 44 33.8
#> 4 2014-01-06 10:30:00.000000+00:00,2014-0~ 35350 22 NA
#> 5 2014-01-06 10:30:00.000000+00:00,2014-0~ 35350 44 0
#> 6 2014-01-06 21:00:00.000000+00:00,2014-0~ 35371 2 1
# Exclude duplicates in the same aggregate operation above.
df_seq_var <- df_seq_grp %>% summarise(var_temp = ts.var(distinct(temperature)))
# Print the results.
df_seq_var %>% arrange(TIMECODE_RANGE, buoyid, var_temp)
#> # Source: lazy query [?? x 4]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, buoyid, var_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUT~ buoyid var_temp
#> <chr> <int64> <int> <dbl>
#> 1 2014-01-06 08:00:00.000000+00:00,2014-0~ 35345 0 2670.
#> 2 2014-01-06 09:00:00.000000+00:00,2014-0~ 35347 1 15.5
#> 3 2014-01-06 10:00:00.000000+00:00,2014-0~ 35349 44 27.7
#> 4 2014-01-06 10:30:00.000000+00:00,2014-0~ 35350 22 NA
#> 5 2014-01-06 10:30:00.000000+00:00,2014-0~ 35350 44 NA
#> 6 2014-01-06 21:00:00.000000+00:00,2014-0~ 35371 2 1
# Calculate the sample variance of values in the 'temperature' column of non-sequenced PTI table.
df_nonseq_var <- df_nonseq_grp %>% summarise(var_temp = ts.var(temperature))
# Print the results.
df_nonseq_var %>% arrange(TIMECODE_RANGE, var_temp)
#> # Source: lazy query [?? x 3]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, var_temp
#> TIMECODE_RANGE `GROUP BY TIME(CAL_YEA~ var_temp
#> <chr> <int64> <dbl>
#> 1 2014-01-01 00:00:00.000000+00:00,2015-01-01 ~ 3 520.
# Calculate the sample variance of values in the 'temperature' column of non-PTI table.
df_nonpti_var <- df_nonpti %>% group_by_time(timebucket.duration = "10m", timecode.column = "TIMECODE") %>% summarise(var_temp = ts.var(temperature))
# Print the results.
df_nonpti_var %>% arrange(TIMECODE_RANGE, var_temp)
#> # Source: lazy query [?? x 3]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, var_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUTES~ var_temp
#> <chr> <int64> <dbl>
#> 1 2014-01-06 08:00:00.000000+00:00,2014-01-06 ~ 2314993 3960.
#> 2 2014-01-06 08:10:00.000000+00:00,2014-01-06 ~ 2314994 4050
#> 3 2014-01-06 09:00:00.000000+00:00,2014-01-06 ~ 2314999 15.5
#> 4 2014-01-06 10:00:00.000000+00:00,2014-01-06 ~ 2315005 33.2
#> 5 2014-01-06 10:10:00.000000+00:00,2014-01-06 ~ 2315006 NA
#> 6 2014-01-06 10:30:00.000000+00:00,2014-01-06 ~ 2315008 NA
#> # ... with more rows
Notes
- NULLs/NAs are not included in the result computation.
- Teradata recommends to strictly use
ts.var
for time series aggregate operations, instead of regular aggregate functionvar
. - When there are fewer than two non-NULL data points in the sample used for the computation,
ts.var
returns NULL/NA. - Division by zero results in NULL/NA value rather than an error.
- If data represents only a sample of the entire population for the column, Teradata recommends to use
ts.var
to calculate sample variance instead ofts.varp
which calculates population variance. As the sample size increases, the values forts.var
andts.varp
approach the same number. ts.var
cannot be used with dplyr verbs likemutate
,filter
as shown in the example below.
# Incorrect usage of ts.var() with mutate and filter.
df_var <- df_seq %>% mutate(var_t = ts.var(temperature))
df_var <- df_seq %>% filter(temperature <= ts.var(temperature))
Population Variance : ts.varp()
The aggregate function ts.varp
returns the population variance of values of the column grouped by time. The variance of a population is a measure of dispersion from the mean of that population. ts.varp
uses the argument value.expression
to specify the column for which population variance is to be computed. Use ts.varp(distinct(column_name))
to exclude duplicate rows while calculating population variance.
# Calculate the population variance of values in the 'temperature' column of sequenced PTI table.
df_seq_varp <- df_seq_grp %>% summarise(varp_temp = ts.varp(temperature))
# Print the results.
df_seq_varp %>% arrange(TIMECODE_RANGE, buoyid, varp_temp)
#> # Source: lazy query [?? x 4]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, buoyid, varp_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUT~ buoyid varp_temp
#> <chr> <int64> <int> <dbl>
#> 1 2014-01-06 08:00:00.000000+00:00,2014-~ 35345 0 2003.
#> 2 2014-01-06 09:00:00.000000+00:00,2014-~ 35347 1 12.9
#> 3 2014-01-06 10:00:00.000000+00:00,2014-~ 35349 44 30.7
#> 4 2014-01-06 10:30:00.000000+00:00,2014-~ 35350 22 0
#> 5 2014-01-06 10:30:00.000000+00:00,2014-~ 35350 44 0
#> 6 2014-01-06 21:00:00.000000+00:00,2014-~ 35371 2 0.667
# Exclude duplicates in the same aggregate operation above.
df_seq_varp <- df_seq_grp %>% summarise(varp_temp = ts.varp(distinct(temperature)))
# Print the results.
df_seq_varp %>% arrange(TIMECODE_RANGE, buoyid, varp_temp)
#> # Source: lazy query [?? x 4]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, buoyid, varp_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUT~ buoyid varp_temp
#> <chr> <int64> <int> <dbl>
#> 1 2014-01-06 08:00:00.000000+00:00,2014-~ 35345 0 1780.
#> 2 2014-01-06 09:00:00.000000+00:00,2014-~ 35347 1 12.9
#> 3 2014-01-06 10:00:00.000000+00:00,2014-~ 35349 44 22.2
#> 4 2014-01-06 10:30:00.000000+00:00,2014-~ 35350 22 0
#> 5 2014-01-06 10:30:00.000000+00:00,2014-~ 35350 44 0
#> 6 2014-01-06 21:00:00.000000+00:00,2014-~ 35371 2 0.667
# Calculate the population variance of values in the 'temperature' column of non-sequenced PTI table.
df_nonseq_varp <- df_nonseq_grp %>% summarise(varp_temp = ts.varp(temperature))
# Print the results.
df_nonseq_varp %>% arrange(TIMECODE_RANGE, varp_temp)
#> # Source: lazy query [?? x 3]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, varp_temp
#> TIMECODE_RANGE `GROUP BY TIME(CAL_YEA~ varp_temp
#> <chr> <int64> <dbl>
#> 1 2014-01-01 00:00:00.000000+00:00,2015-01-01~ 3 500.
# Calculate the population variance of values in the 'temperature' column of non-PTI table.
df_nonpti_varp <- df_nonpti %>% group_by_time(timebucket.duration = "10m", timecode.column = "TIMECODE") %>% summarise(varp_temp = ts.varp(temperature))
# Print the results.
df_nonpti_varp %>% arrange(TIMECODE_RANGE, varp_temp)
#> # Source: lazy query [?? x 3]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, varp_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUTES~ varp_temp
#> <chr> <int64> <dbl>
#> 1 2014-01-06 08:00:00.000000+00:00,2014-01-06~ 2314993 1980.
#> 2 2014-01-06 08:10:00.000000+00:00,2014-01-06~ 2314994 2025
#> 3 2014-01-06 09:00:00.000000+00:00,2014-01-06~ 2314999 12.9
#> 4 2014-01-06 10:00:00.000000+00:00,2014-01-06~ 2315005 29.8
#> 5 2014-01-06 10:10:00.000000+00:00,2014-01-06~ 2315006 0
#> 6 2014-01-06 10:30:00.000000+00:00,2014-01-06~ 2315008 0
#> # ... with more rows
Notes
- NULLs/NAs are not included in the result computation.
- Teradata recommends to strictly use
ts.varp
for time series aggregate operations, instead of regular aggregate functionvarp
. - When there are fewer than two non-NULL data points in the population used for the computation,
ts.varp
returns NULL/NA. - Division by zero results in NULL value rather than an error.
ts.varp
can only be used if data represents entire population. Otherwise, Teradata recommends to usets.var
to calculate sample variance.ts.varp
cannot be used with dplyr verbs likemutate
,filter
as shown in the example below.
# Incorrect usage of ts.varp() with mutate and filter.
df_varp <- df_seq %>% mutate(varp_t = ts.varp(temperature))
df_varp <- df_seq %>% filter(temperature <= ts.varp(temperature))
Average : ts.mean()
The aggregate function ts.mean
returns the average value in the column grouped by time. ts.mean
uses the argument value.expression
to specify the column for which average value is to be computed. Use ts.mean(distinct(column_name))
to exclude duplicate rows while calculating average.
# Calculate the average value in the 'temperature' column of sequenced PTI table.
df_seq_avg <- df_seq_grp %>% summarise(avg_temp = ts.mean(temperature))
# Print the results.
df_seq_avg %>% arrange(TIMECODE_RANGE, buoyid, avg_temp)
#> # Source: lazy query [?? x 4]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, buoyid, avg_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUT~ buoyid avg_temp
#> <chr> <int64> <int> <dbl>
#> 1 2014-01-06 08:00:00.000000+00:00,2014-0~ 35345 0 54.8
#> 2 2014-01-06 09:00:00.000000+00:00,2014-0~ 35347 1 74.5
#> 3 2014-01-06 10:00:00.000000+00:00,2014-0~ 35349 44 49
#> 4 2014-01-06 10:30:00.000000+00:00,2014-0~ 35350 22 23
#> 5 2014-01-06 10:30:00.000000+00:00,2014-0~ 35350 44 43
#> 6 2014-01-06 21:00:00.000000+00:00,2014-0~ 35371 2 81
# Exclude duplicates in the same aggregate operation above.
df_seq_avg <- df_seq_grp %>% summarise(avg_temp = ts.mean(distinct(temperature)))
# Print the results.
df_seq_avg %>% arrange(TIMECODE_RANGE, buoyid, avg_temp)
#> # Source: lazy query [?? x 4]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, buoyid, avg_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUT~ buoyid avg_temp
#> <chr> <int64> <int> <dbl>
#> 1 2014-01-06 08:00:00.000000+00:00,2014-0~ 35345 0 69.7
#> 2 2014-01-06 09:00:00.000000+00:00,2014-0~ 35347 1 74.5
#> 3 2014-01-06 10:00:00.000000+00:00,2014-0~ 35349 44 52.2
#> 4 2014-01-06 10:30:00.000000+00:00,2014-0~ 35350 22 23
#> 5 2014-01-06 10:30:00.000000+00:00,2014-0~ 35350 44 43
#> 6 2014-01-06 21:00:00.000000+00:00,2014-0~ 35371 2 81
# Calculate the average value in the 'temperature' column of non-sequenced PTI table.
df_nonseq_avg <- df_nonseq_grp %>% summarise(avg_temp = ts.mean(temperature))
# Print the results.
df_nonseq_avg %>% arrange(TIMECODE_RANGE, avg_temp)
#> # Source: lazy query [?? x 3]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, avg_temp
#> TIMECODE_RANGE `GROUP BY TIME(CAL_YEA~ avg_temp
#> <chr> <int64> <dbl>
#> 1 2014-01-01 00:00:00.000000+00:00,2015-01-01 ~ 3 59
# Calculate the average value in the 'temperature' column of non-PTI table.
df_nonpti_avg <- df_nonpti %>% group_by_time(timebucket.duration = "10m", timecode.column = "TIMECODE") %>% summarise(avg_temp = ts.mean(temperature))
# Print the results.
df_nonpti_avg %>% arrange(TIMECODE_RANGE, avg_temp)
#> # Source: lazy query [?? x 3]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, avg_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUTES~ avg_temp
#> <chr> <int64> <dbl>
#> 1 2014-01-06 08:00:00.000000+00:00,2014-01-06 ~ 2314993 54.5
#> 2 2014-01-06 08:10:00.000000+00:00,2014-01-06 ~ 2314994 55
#> 3 2014-01-06 09:00:00.000000+00:00,2014-01-06 ~ 2314999 74.5
#> 4 2014-01-06 10:00:00.000000+00:00,2014-01-06 ~ 2315005 49.6
#> 5 2014-01-06 10:10:00.000000+00:00,2014-01-06 ~ 2315006 43
#> 6 2014-01-06 10:30:00.000000+00:00,2014-01-06 ~ 2315008 43
#> # ... with more rows
Notes
- NULLs/NAs are not included in the result computation.
- Teradata recommends to strictly use
ts.mean
for time series aggregate operations, instead of regular aggregate functionmean
. ts.mean
cannot be used with dplyr verbs likemutate
,filter
as shown in the example below.
# Incorrect usage of ts.mean() with mutate and filter.
df_avg <- df_seq %>% mutate(avg_t = ts.mean(temperature))
df_avg <- df_seq %>% filter(temperature <= ts.mean(temperature))
Count : ts.n()
The aggregate function ts.n
returns the total number of qualified rows in the column grouped by time. ts.n
uses the argument value.expression
to specify the column for which count is to be computed. The function returns the number of rows per group when the argument takes *
, which is the default value of the function. Use ts.n(distinct(column_name))
to exclude duplicate rows while calculating the number of rows.
# Calculate the number of rows in the 'temperature' column of sequenced PTI table.
df_seq_count <- df_seq_grp %>% summarise(count_temp = ts.n(temperature))
# Print the results.
df_seq_count %>% arrange(TIMECODE_RANGE, buoyid, count_temp)
#> # Source: lazy query [?? x 4]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, buoyid, count_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUT~ buoyid count_temp
#> <chr> <int64> <int> <int64>
#> 1 2014-01-06 08:00:00.000000+00:00,2014~ 35345 0 4
#> 2 2014-01-06 09:00:00.000000+00:00,2014~ 35347 1 6
#> 3 2014-01-06 10:00:00.000000+00:00,2014~ 35349 44 11
#> 4 2014-01-06 10:30:00.000000+00:00,2014~ 35350 22 1
#> 5 2014-01-06 10:30:00.000000+00:00,2014~ 35350 44 2
#> 6 2014-01-06 21:00:00.000000+00:00,2014~ 35371 2 3
# Exclude duplicates in the same aggregate operation above.
df_seq_count <- df_seq_grp %>% summarise(count_temp = ts.n(distinct(temperature)))
# Print the results.
df_seq_count %>% arrange(TIMECODE_RANGE, buoyid, count_temp)
#> # Source: lazy query [?? x 4]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, buoyid, count_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUT~ buoyid count_temp
#> <chr> <int64> <int> <int64>
#> 1 2014-01-06 08:00:00.000000+00:00,2014~ 35345 0 3
#> 2 2014-01-06 09:00:00.000000+00:00,2014~ 35347 1 6
#> 3 2014-01-06 10:00:00.000000+00:00,2014~ 35349 44 5
#> 4 2014-01-06 10:30:00.000000+00:00,2014~ 35350 22 1
#> 5 2014-01-06 10:30:00.000000+00:00,2014~ 35350 44 1
#> 6 2014-01-06 21:00:00.000000+00:00,2014~ 35371 2 3
# Calculate the number of rows in the sequenced PTI table.
# Note that the argument is not provided for the function `ts.n()`.
df_seq_count <- df_seq_grp %>% summarise(count_temp = ts.n())
# Print the results.
df_seq_count %>% arrange(TIMECODE_RANGE, buoyid, count_temp)
#> # Source: lazy query [?? x 4]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, buoyid, count_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUT~ buoyid count_temp
#> <chr> <int64> <int> <int64>
#> 1 2014-01-06 08:00:00.000000+00:00,2014~ 35345 0 5
#> 2 2014-01-06 09:00:00.000000+00:00,2014~ 35347 1 6
#> 3 2014-01-06 10:00:00.000000+00:00,2014~ 35349 44 11
#> 4 2014-01-06 10:30:00.000000+00:00,2014~ 35350 22 1
#> 5 2014-01-06 10:30:00.000000+00:00,2014~ 35350 44 2
#> 6 2014-01-06 21:00:00.000000+00:00,2014~ 35371 2 3
# Calculate the number of rows in the 'temperature' column of non-sequenced PTI table.
df_nonseq_count <- df_nonseq_grp %>% summarise(count_temp = ts.n(temperature))
# Print the results.
df_nonseq_count %>% arrange(TIMECODE_RANGE, count_temp)
#> # Source: lazy query [?? x 3]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, count_temp
#> TIMECODE_RANGE `GROUP BY TIME(CAL_YEA~ count_temp
#> <chr> <int64> <int64>
#> 1 2014-01-01 00:00:00.000000+00:00,2015-01-0~ 3 26
# Calculate the number of rows in the 'temperature' column of non-PTI table.
df_nonpti_count <- df_nonpti %>% group_by_time(timebucket.duration = "10m", timecode.column = "TIMECODE") %>% summarise(count_temp = ts.n(temperature))
# Print the results.
df_nonpti_count %>% arrange(TIMECODE_RANGE, count_temp)
#> # Source: lazy query [?? x 3]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, count_temp
#> TIMECODE_RANGE `GROUP BY TIME(MINUTE~ count_temp
#> <chr> <int64> <int64>
#> 1 2014-01-06 08:00:00.000000+00:00,2014-01-06~ 2314993 2
#> 2 2014-01-06 08:10:00.000000+00:00,2014-01-06~ 2314994 2
#> 3 2014-01-06 09:00:00.000000+00:00,2014-01-06~ 2314999 6
#> 4 2014-01-06 10:00:00.000000+00:00,2014-01-06~ 2315005 10
#> 5 2014-01-06 10:10:00.000000+00:00,2014-01-06~ 2315006 1
#> 6 2014-01-06 10:30:00.000000+00:00,2014-01-06~ 2315008 1
#> # ... with more rows
Notes
- NULLs/NAs are not included in the result computation.
ts.n
is valid on all columns with any data type.- Teradata recommends to strictly use
ts.n
for time series aggregate operations, instead of regular aggregate functionn
. ts.n
cannot be used with dplyr verbs likemutate
,filter
as shown in the example below.
# Incorrect usage of ts.n() with mutate and filter.
df_count <- df_seq %>% mutate(count_t = ts.n(temperature))
df_count <- df_seq %>% filter(temperature <= ts.n(temperature))
Describe : ts.describe()
The aggregate function ts.describe
returns the column statistics of the values in each group. By default, the function calulates maximum, minimum, average, sample standard deviation, median, mode, 25th percentile, 50th percentile and 75th percentile values of the column. ts.describe
uses the argument value.expression
to specify the column for which the statistics are to be calculated. Use ts.describe(distinct(column_name))
to exclude duplicate rows while calculating statistics.
Unlike other aggregate functions, ts.describe
generates multiple columns. The column name provided in argument summarise
will not be used as the resultant column name. The resultant column names are as follows:
Without distinct() | With distinct() |
---|---|
Maximum(column_name) | Maximum(Distinct(column_name)) |
Miniumum(column_name) | Minimum(Distinct(column_name)) |
Average(column_name) | Average(Distinct(column_name) |
STDDEV_SAMP(column_name) | STDDEV_SAMP(Distinct(column_name)) |
MEDIAN(column_name) | MEDIAN(Distinct(column_name)) |
MODE(column_name) | MODE(column_name) |
PERCENTILE(column_name, 25, LINEAR)) | PERCENTILE(Distinct(column_name, 25, LINEAR)) |
PERCENTILE(column_name, 50, LINEAR)) | PERCENTILE(Distinct(column_name, 50, LINEAR)) |
PERCENTILE(column_name, 75, LINEAR)) | PERCENTILE(Distinct(column_name, 75, LINEAR)) |
Please refer to notes section for additional information.
# Calculate the statistics of the 'temperature' column of sequenced PTI table.
df_seq_describe <- df_seq_grp %>% summarise(describe_temp = ts.describe(temperature))
# Print the results.
df_seq_describe %>% arrange(TIMECODE_RANGE, buoyid, `MODE(temperature)`)
#> # Source: lazy query [?? x 12]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, buoyid, `MODE(temperature)`
#> TIMECODE_RANGE `GROUP BY TIME(~ buoyid `Maximum(temper~ `Minimum(temper~
#> <chr> <int64> <int> <int> <int>
#> 1 2014-01-06 08~ 35345 0 100 10
#> 2 2014-01-06 09~ 35347 1 79 70
#> 3 2014-01-06 09~ 35347 1 79 70
#> 4 2014-01-06 09~ 35347 1 79 70
#> 5 2014-01-06 09~ 35347 1 79 70
#> 6 2014-01-06 09~ 35347 1 79 70
#> # ... with more rows, and 7 more variables: `Average(temperature)` <dbl>,
#> # `STDDEV_SAMP(temperature)` <dbl>, `MEDIAN(temperature)` <dbl>,
#> # `MODE(temperature)` <int>, `PERCENTILE(temperature, 25, LINEAR)` <dbl>,
#> # `PERCENTILE(temperature, 50, LINEAR)` <dbl>, `PERCENTILE(temperature, 75,
#> # LINEAR)` <dbl>
# Calculate the statistics of the 'temperature' column of non-sequenced PTI table.
df_nonseq_describe <- df_nonseq_grp %>% summarise(ts.describe(temperature))
# Print the results.
df_nonseq_describe
#> # Source: lazy query [?? x 11]
#> # Database: Teradata
#> TIMECODE_RANGE `GROUP BY TIME(~ `Maximum(temper~ `Minimum(temper~
#> <chr> <int64> <int> <int>
#> 1 2014-01-01 00~ 3 100 10
#> # ... with 7 more variables: `Average(temperature)` <dbl>,
#> # `STDDEV_SAMP(temperature)` <dbl>, `MEDIAN(temperature)` <dbl>,
#> # `MODE(temperature)` <int>, `PERCENTILE(temperature, 25, LINEAR)` <dbl>,
#> # `PERCENTILE(temperature, 50, LINEAR)` <dbl>, `PERCENTILE(temperature, 75,
#> # LINEAR)` <dbl>
# Calculate the statistics of only distinct values of the 'temperature' column of non-PTI table.
df_nonpti_describe <- df_nonpti %>% group_by_time(timebucket.duration = "10m", timecode.column = "TIMECODE") %>% summarise(describe_temp = ts.describe(distinct(temperature)))
# Print the results.
df_nonpti_describe %>% arrange(TIMECODE_RANGE, `Mode(temperature)`)
#> # Source: lazy query [?? x 11]
#> # Database: Teradata
#> # Ordered by: TIMECODE_RANGE, `Mode(temperature)`
#> TIMECODE_RANGE `GROUP BY TIME(~ `Maximum(Distin~ `Minimum(Distin~
#> <chr> <int64> <int> <int>
#> 1 2014-01-06 08~ 2314993 99 10
#> 2 2014-01-06 08~ 2314993 99 10
#> 3 2014-01-06 08~ 2314994 100 10
#> 4 2014-01-06 08~ 2314994 100 10
#> 5 2014-01-06 09~ 2314999 79 70
#> 6 2014-01-06 09~ 2314999 79 70
#> # ... with more rows, and 7 more variables:
#> # `Average(Distinct(temperature))` <dbl>,
#> # `STDDEV_SAMP(Distinct(temperature))` <dbl>,
#> # `MEDIAN(Distinct(temperature))` <dbl>, `MODE(temperature)` <int>,
#> # `PERCENTILE(Distinct(temperature, 25, LINEAR))` <dbl>,
#> # `PERCENTILE(Distinct(temperature, 50, LINEAR))` <dbl>,
#> # `PERCENTILE(Distinct(temperature, 75, LINEAR))` <dbl>
The following example shows how to use the resultant columns of the describe operation in other dplyr verbs:
# Select some columns from the describe operation and perform filter on one of the selected columns.
df_sel <- df_nonpti_describe %>% select(`GROUP BY TIME(MINUTES(10))`,
`PERCENTILE(Distinct(temperature, 75, LINEAR))`,
`Average(Distinct(temperature))`,
`MODE(temperature)`
)
df_filter <- df_sel %>% filter(`PERCENTILE(Distinct(temperature, 75, LINEAR))` < 77)
# Print the results.
df_filter %>% arrange(`GROUP BY TIME(MINUTES(10))`, `MODE(temperature)`)
#> # Source: lazy query [?? x 4]
#> # Database: Teradata
#> # Ordered by: `GROUP BY TIME(MINUTES(10))`, `MODE(temperature)`
#> `GROUP BY TIME(MIN~ `PERCENTILE(Distinct(~ `Average(Distinct~ `MODE(temperatu~
#> <int64> <dbl> <dbl> <int>
#> 1 2314993 76.8 54.5 10
#> 2 2314993 76.8 54.5 99
#> 3 2315005 55 52.2 43
#> 4 2315006 43 43 43
#> 5 2315008 43 43 43
#> 6 2315010 43 43 43
Notes
- NULLs/NAs are not included in the result computation.
ts.describe
is valid only for numeric data.ts.describe
must be used withgroup_by_time()
.- Unlike other statistical values, mode values are calculated without excluding duplicates even though
distinct()
is used. - Mode computation can result in duplicate rows with different values.
- Percentile computations use a linear interpolation process to determine if the percentile lies between two data points. Please refer to
ts.percentile
for additional interpolation options.