Teradata R Package Function Reference | 17.00 - 17.00 - group_by_time - Teradata R Package

Teradata® R Package Function Reference

prodname
Teradata R Package
vrm_release
17.00
created_date
September 2020
category
Programming Reference
featnum
B700-4007-090K

Description

This function takes an existing tbl_teradata and converts into a grouped tbl_teradata object that allows a set of aggregate functions to be computed on data grouped in terms of time. Although the grouping is optimized for objects of class "tbl_teradata" created using PTI tables, it is also supported on objects of class "tbl_teradata" created using non-PTI tables when the argument "timecode.column" is specified.
Note: If consecutive groupings are performed, latest grouping operation is considered, i.e. if groupby is followed by groupby_time, groupby_time grouping is considered for aggregation. Similarly, if groupby_time is followed by groupby, groupby grouping is considered for aggregation. If consecutive groupby_time operations are performed, latest groupby_time is considered for aggregation.

Usage

group_by_time(
  df,
  timebucket.duration,
  value.expression = c(),
  timecode.column = NULL,
  sequence.column = NULL,
  fill = NULL,
  ...
)

Arguments

df

Required Argument.
Specifies the tbl_teradata on which grouping based on time is to be done.

timebucket.duration

Required Argument.
Specifies the duration of each timebucket for aggregation and is used to assign each potential timebucket a unique number. The timebucket number is shown as a column in the aggregated tbl_teradata.
A time duration can be specified using any of the units of time as shown below:

------------------------------------------------------- ---------------------------------------------------------------------
Time Unit Formal Form Example Shorthand Equivalents
------------------------------------------------------- ---------------------------------------------------------------------
Calendar Years CAL_YEARS(N) Ncy, Ncyear, Ncyears
Calendar Months CAL_MONTHS(N) Ncm, Ncmonth, Ncmonths
Calendar Days CAL_DAYS(N) Ncd, Ncday, Ncdays
Weeks WEEKS(N) Nw, Nweek, Nweeks
Days DAYS(N) Nd, Nday, Ndays
Hours HOURS(N) Nh, Nhr, Nhrs, Nhour, Nhours
Minutes MINUTES(N) Nm, Nmins, Nminute, Nminutes
Seconds SECONDS(N) Ns, Nsec, Nsecs, Nsecond, Nseconds
Milliseconds MILLISECONDS(N) Nms, Nmsec, Nmsecs, Nmillisecond, Nmilliseconds
Microseconds MICROSECONDS(N) Nus, Nusec, Nusecs, Nmicrosecond, Nmicroseconds
------------------------------------------------------- ---------------------------------------------------------------------

Where, N is a 16-bit positive integer with a maximum value of 32767.
When timebucket.duration is Calendar Days, columns are grouped in 24-hour periods starting at 00:00:00.000000 and ending at 23:59:59.999999 on the day identified by time zero.
The time units do not store values such as the year or the month. For example, CAL_YEARS(2017) does not set the year to 2017. It sets the timebucket.duration to intervals of 2017 years. Similarly, CAL_MONTHS(7) does not set the month to July. It sets the timebucket.duration to intervals of 7 months.
A DAYS time unit is a 24-hour span relative to any moment in time. For example, if time zero (while creating PTI tables in Teradata SQL Engine) is 2016-10-01 12:00:00, the day buckets are 2016-10-01 12:00:00.000000 - 2016-10-02 11:59:59.999999. This spans multiple calendar days, but encompasses one 24-hour period representative of a day.
Note: This argument can take unbounded timebucket duration ("*") that can only be used with ts.delta_t time series aggregate function.
Types: str
Examples:

  1. MINUTES(23) which is equal to 23 minutes

  2. CAL_MONTHS(5) which is equal to 5 calendar months

value.expression

Optional Argument.
Specifies a column or any expression involving columns. These expressions are used for grouping purposes not related to time.
Note:

  1. The value.expression must not be a column reference to a view column that is derived from a function and cannot contain any ordered analytical or aggregate functions.

  2. The value.expression cannot be a literal.

Default Value : c()
Types: character OR vector of characters

timecode.column

Optional Argument.
Specifies a column that serves as the timecode for a tbl_teradata created of a non-PTI table.
For tbl_teradata created using PTI table, this argument takes TD_TIMECODE implicitly, but it can also be specified explicitly by the user with this argument.
For tbl_teradata created using non-PTI table, one must pass column name to this argument, otherwise an error is raised.
Default Value : NULL
Types: character

sequence.column

Optional Argument.
Specifies column expression that is the sequence number.
For tbl_teradata created using PTI table, it can be TD_SEQNO or any other column that acts as a sequence number.
For tbl_teradata created using non-PTI table, sequence.column is a column that plays the role of TD_SEQNO, because non-PTI tables do not have TD_SEQNO.
Note: This argument can only be given when "timecode.column" is not NULL. Default Value : NULL
Types: character

fill

Optional Argument.
Specifies the value to be used for missing timebucket values.
Below is the description of accepted values:

  1. NULLS: The missing timebuckets are returned to the user with a NULL value for all aggregate results.

  2. numeric_constant: Any Teradata SQL Engine supported Numeric literal. The missing timebuckets are returned to the user with the specified constant value for all aggregate results. If the data type specified in this argument is incompatible with the input data type for an aggregate function, an error is reported.

  3. PREVIOUS/PREV: The missing timebuckets are returned to the user with the aggregate results populated by the value of the closest previous timebucket with a non-missing value. If the immediate predecessor of a missing timebucket is also missing, both buckets, and any other immediate predecessors with missing values, are loaded with the first preceding non-missing value. If a missing timebucket has no predecessor with a result (for example, if the timebucket is the first in the series or all the preceding timebuckets in the entire series are missing), the missing timebuckets are returned to the user with a NULL value for all aggregate results. The abbreviation PREV may be used instead of PREVIOUS.

  4. NEXT: The missing timebuckets are returned to the user with the aggregate results populated by the value of the closest succeeding timebucket with a non-missing value. If the immediate successor of a missing timebucket is also missing, both buckets, and any other immediate successors with missing values, are loaded with the first succeeding non-missing value. If a missing timebucket has no successor with a result (for example, if the timebucket is the last in the series or all the succeeding timebuckets in the entire series are missing), the missing timebuckets are returned to the user with a NULL value for all aggregate results.

Default Value : NULL
Permitted Values : NULLS, PREVIOUS, PREV, NEXT and any numeric constant
Types: character or a numeric value

...

Optional arguments for future purpose, if needed.

Value

A 'tbl_teradata' object.

Examples

# Get remote data source connection.
con <- td_get_context()$connection

# Load the required tables.
loadExampleData("time_series_example", "ocean_buoys_seq", "ocean_buoys_nonseq", "ocean_buoys_nonpti", "package_tracking_pti")

# Create objects of class "tbl_teradata".
df_seq <- tbl(con, "ocean_buoys_seq") # tbl of sequenced PTI table.
df_nonseq <- tbl(con, "ocean_buoys_nonseq") # tbl of non-sequenced PTI table.
df_nonpti <- tbl(con, "ocean_buoys_nonpti") # tbl of non-PTI table.

# Example 1: Grouping the sequenced PTI tables based on time with timebucket duration of 30 minutes and
#            the column 'buoyid' and ignoring the aggregation when there is no values in the group.

# Grouping by timebucket duration of 30 minutes and 'buoyid'.
# Note the use of shorthand notation for timebucket duration.
seq_group1 <- df_seq %>% group_by_time(timebucket.duration = "30m", value.expression = "buoyid")

# Applying min() aggregation on grouped tbl object.
seq_group1 %>% summarise(min_temp = min(temperature))

# Example 2: Grouping the non-PTI tables based on time with timebucket duration of 1 minute
#            and fills the missing timebuckets with previous values.

# Grouping by timebucket duration of 1 minute.
# Note the use of formal notation for timebucket duration.
# Note the use of timecode.column argument (mandatory for non-PTI table).
seq_group2 <- df_nonpti %>% group_by_time(timebucket.duration = "MINUTES(1)", timecode.column = "TIMECODE", fill = "PREV")

# Applying min() aggregation on grouped tbl object.
temp_grouped <- seq_group2 %>% summarise(min_temp = min(temperature))

# Displaying the rows in the increasing order to TIMECODE_RANGE or 'GROUP BY TIME(MINUTES(1))' column.
temp_grouped %>% arrange(`GROUP BY TIME(MINUTES(1))`)

# Example 3: Grouping the non-sequenced PTI tables based on time with timebucket duration of 
#            2 Calendar years. Note the use of timecode.column and sequence.column arguments.

# The next three grouping give a quick demo of the use of fill argument keeping all other arguments
# same.
# With fill = "PREVIOUS"
grp1 <- df_nonseq %>% group_by_time(timebucket.duration = "CAL_YEARS(1)", timecode.column = "TD_TIMECODE", sequence.column = "buoyid", fill = "PREVIOUS")
min_t <- grp1 %>% summarise(min_temp = min(temperature))
min_t %>% arrange(`GROUP BY TIME(CAL_YEARS(1))`)

# With fill = "NEXT"
grp1 <- df_nonseq %>% group_by_time(timebucket.duration = "CAL_YEARS(1)", timecode.column = "TD_TIMECODE", sequence.column = "buoyid", fill = "NEXT")
min_t <- grp1 %>% summarise(min_temp = min(temperature))
min_t %>% arrange(`GROUP BY TIME(CAL_YEARS(1))`)

# With fill = 10000 (some numeric constant)
grp1 <- df_nonseq %>% group_by_time(timebucket.duration = "CAL_YEARS(1)", timecode.column = "TD_TIMECODE", sequence.column = "buoyid", fill = 10000)
min_t <- grp1 %>% summarise(min_temp = min(temperature))
min_t %>% arrange(`GROUP BY TIME(CAL_YEARS(1))`)

# Example 4: Finding Time Elapsed between Shipping and Receiving an Item. Input data used for this 
#            example contains information about parcels sent by a delivery service. This example also
#            demonstrates the use of unbounded timebucket duration ('*') for 'ts.delta_t' time series
#            aggregate function.
df_pack_pti <- tbl(con, "package_tracking_pti")
df_grp <- df_pack_pti %>% group_by_time(timebucket.duration = "*", value.expression = "parcel_number")
df_out <- df_grp %>% summarise(delta_t = ts.delta_t("Status LIKE 'picked%up%customer'", "Status LIKE 'delivered%customer'"))
df_out %>% arrange(TIMECODE_RANGE, parcel_number)