Description
This function takes an existing tbl_teradata and converts into a grouped tbl_teradata
object that allows a set of aggregate functions to be computed on data grouped in terms of time.
Although the grouping is optimized for objects of class "tbl_teradata" created using PTI tables,
it is also supported on objects of class "tbl_teradata" created using non-PTI tables when the
argument "timecode.column" is specified.
Note: If consecutive groupings are performed, latest grouping operation is considered, i.e. if
groupby
is followed by groupby_time
, groupby_time
grouping is considered
for aggregation. Similarly, if groupby_time
is followed by groupby
, groupby
grouping is considered for aggregation. If consecutive groupby_time
operations are
performed, latest groupby_time
is considered for aggregation.
Usage
group_by_time(
df,
timebucket.duration,
value.expression = c(),
timecode.column = NULL,
sequence.column = NULL,
fill = NULL,
...
)
Arguments
df |
Required Argument.
Specifies the tbl_teradata on which grouping based on time is to be done.
|
timebucket.duration |
Required Argument.
Specifies the duration of each timebucket for aggregation and is
used to assign each potential timebucket a unique number. The
timebucket number is shown as a column in the aggregated
tbl_teradata.
A time duration can be specified using any of the units of time
as shown below:
------------------------ | ------------------------------- |
--------------------------------------------------------------------- |
Time Unit | Formal Form Example | Shorthand Equivalents |
------------------------ | ------------------------------- |
--------------------------------------------------------------------- |
Calendar Years | CAL_YEARS(N) | Ncy, Ncyear, Ncyears |
Calendar Months | CAL_MONTHS(N) | Ncm, Ncmonth, Ncmonths |
Calendar Days | CAL_DAYS(N) | Ncd, Ncday, Ncdays |
Weeks | WEEKS(N) | Nw, Nweek, Nweeks |
Days | DAYS(N) | Nd, Nday, Ndays |
Hours | HOURS(N) | Nh, Nhr, Nhrs, Nhour, Nhours |
Minutes | MINUTES(N) | Nm, Nmins, Nminute, Nminutes |
Seconds | SECONDS(N) | Ns, Nsec, Nsecs, Nsecond,
Nseconds |
Milliseconds | MILLISECONDS(N) | Nms, Nmsec, Nmsecs,
Nmillisecond, Nmilliseconds |
Microseconds | MICROSECONDS(N) | Nus, Nusec, Nusecs,
Nmicrosecond, Nmicroseconds |
------------------------ | ------------------------------- |
---------------------------------------------------------------------
|
Where, N is a 16-bit positive integer with a maximum value of
32767.
When timebucket.duration is Calendar Days, columns are grouped
in 24-hour periods starting at 00:00:00.000000 and ending at
23:59:59.999999 on the day identified by time zero.
The time units do not store values such as the year or the month.
For example, CAL_YEARS(2017) does not set the year to 2017. It sets
the timebucket.duration to intervals of 2017 years. Similarly,
CAL_MONTHS(7) does not set the month to July. It sets the
timebucket.duration to intervals of 7 months.
A DAYS time unit is a 24-hour span relative to any moment in time.
For example, if time zero (while creating PTI tables in Teradata
SQL Engine) is 2016-10-01 12:00:00, the day buckets are
2016-10-01 12:00:00.000000 - 2016-10-02 11:59:59.999999.
This spans multiple calendar days,
but encompasses one 24-hour period representative of a day.
Note: This argument can take unbounded timebucket duration ("*") that
can only be used with ts.delta_t time series aggregate
function.
Types: str
Examples:
MINUTES(23) which is equal to 23 minutes
CAL_MONTHS(5) which is equal to 5 calendar months
|
value.expression |
Optional Argument.
Specifies a column or any expression involving columns. These
expressions are used for grouping purposes not related to time.
Note:
The value.expression must not be a column reference to a view
column that is derived from a function and cannot contain any ordered
analytical or aggregate functions.
The value.expression cannot be a literal.
Default Value : c()
Types: character OR vector of characters
|
timecode.column |
Optional Argument.
Specifies a column that serves as the timecode for a tbl_teradata
created of a non-PTI table.
For tbl_teradata created using PTI table, this argument takes
TD_TIMECODE implicitly, but it can also be specified explicitly by the
user with this argument.
For tbl_teradata created using non-PTI table, one must pass column
name to this argument, otherwise an error is raised.
Default Value : NULL
Types: character
|
sequence.column |
Optional Argument.
Specifies column expression that is the sequence number.
For tbl_teradata created using PTI table, it can be TD_SEQNO or any
other column that acts as a sequence number.
For tbl_teradata created using non-PTI table, sequence.column is a
column that plays the role of TD_SEQNO, because non-PTI tables do not
have TD_SEQNO.
Note: This argument can only be given when "timecode.column" is not NULL.
Default Value : NULL
Types: character
|
fill |
Optional Argument.
Specifies the value to be used for missing timebucket values.
Below is the description of accepted values:
NULLS: The missing timebuckets are returned to the user with a NULL value
for all aggregate results.
numeric_constant: Any Teradata SQL Engine supported Numeric literal. The
missing timebuckets are returned to the user with the specified constant value
for all aggregate results. If the data type specified in this argument is
incompatible with the input data type for an aggregate function, an error is reported.
PREVIOUS/PREV: The missing timebuckets are returned to the user with the aggregate
results populated by the value of the closest previous timebucket with a non-missing value.
If the immediate predecessor of a missing timebucket is also missing, both buckets, and
any other immediate predecessors with missing values, are loaded with the first preceding
non-missing value. If a missing timebucket has no predecessor with a result (for example,
if the timebucket is the first in the series or all the preceding timebuckets in the entire
series are missing), the missing timebuckets are returned to the user with a NULL value for
all aggregate results. The abbreviation PREV may be used instead of PREVIOUS.
NEXT: The missing timebuckets are returned to the user with the aggregate results
populated by the value of the closest succeeding timebucket with a non-missing value. If
the immediate successor of a missing timebucket is also missing, both buckets, and any
other immediate successors with missing values, are loaded with the first succeeding
non-missing value. If a missing timebucket has no successor with a result (for example,
if the timebucket is the last in the series or all the succeeding timebuckets in the
entire series are missing), the missing timebuckets are returned to the user with a NULL
value for all aggregate results.
Default Value : NULL
Permitted Values : NULLS, PREVIOUS, PREV, NEXT and any numeric constant
Types: character or a numeric value
|
... |
Optional arguments for future purpose, if needed.
|
Value
A 'tbl_teradata' object.
Examples
# Get remote data source connection.
con <- td_get_context()$connection
# Load the required tables.
loadExampleData("time_series_example", "ocean_buoys_seq", "ocean_buoys_nonseq", "ocean_buoys_nonpti", "package_tracking_pti")
# Create objects of class "tbl_teradata".
df_seq <- tbl(con, "ocean_buoys_seq") # tbl of sequenced PTI table.
df_nonseq <- tbl(con, "ocean_buoys_nonseq") # tbl of non-sequenced PTI table.
df_nonpti <- tbl(con, "ocean_buoys_nonpti") # tbl of non-PTI table.
# Example 1: Grouping the sequenced PTI tables based on time with timebucket duration of 30 minutes and
# the column 'buoyid' and ignoring the aggregation when there is no values in the group.
# Grouping by timebucket duration of 30 minutes and 'buoyid'.
# Note the use of shorthand notation for timebucket duration.
seq_group1 <- df_seq %>% group_by_time(timebucket.duration = "30m", value.expression = "buoyid")
# Applying min() aggregation on grouped tbl object.
seq_group1 %>% summarise(min_temp = min(temperature))
# Example 2: Grouping the non-PTI tables based on time with timebucket duration of 1 minute
# and fills the missing timebuckets with previous values.
# Grouping by timebucket duration of 1 minute.
# Note the use of formal notation for timebucket duration.
# Note the use of timecode.column argument (mandatory for non-PTI table).
seq_group2 <- df_nonpti %>% group_by_time(timebucket.duration = "MINUTES(1)", timecode.column = "TIMECODE", fill = "PREV")
# Applying min() aggregation on grouped tbl object.
temp_grouped <- seq_group2 %>% summarise(min_temp = min(temperature))
# Displaying the rows in the increasing order to TIMECODE_RANGE or 'GROUP BY TIME(MINUTES(1))' column.
temp_grouped %>% arrange(`GROUP BY TIME(MINUTES(1))`)
# Example 3: Grouping the non-sequenced PTI tables based on time with timebucket duration of
# 2 Calendar years. Note the use of timecode.column and sequence.column arguments.
# The next three grouping give a quick demo of the use of fill argument keeping all other arguments
# same.
# With fill = "PREVIOUS"
grp1 <- df_nonseq %>% group_by_time(timebucket.duration = "CAL_YEARS(1)", timecode.column = "TD_TIMECODE", sequence.column = "buoyid", fill = "PREVIOUS")
min_t <- grp1 %>% summarise(min_temp = min(temperature))
min_t %>% arrange(`GROUP BY TIME(CAL_YEARS(1))`)
# With fill = "NEXT"
grp1 <- df_nonseq %>% group_by_time(timebucket.duration = "CAL_YEARS(1)", timecode.column = "TD_TIMECODE", sequence.column = "buoyid", fill = "NEXT")
min_t <- grp1 %>% summarise(min_temp = min(temperature))
min_t %>% arrange(`GROUP BY TIME(CAL_YEARS(1))`)
# With fill = 10000 (some numeric constant)
grp1 <- df_nonseq %>% group_by_time(timebucket.duration = "CAL_YEARS(1)", timecode.column = "TD_TIMECODE", sequence.column = "buoyid", fill = 10000)
min_t <- grp1 %>% summarise(min_temp = min(temperature))
min_t %>% arrange(`GROUP BY TIME(CAL_YEARS(1))`)
# Example 4: Finding Time Elapsed between Shipping and Receiving an Item. Input data used for this
# example contains information about parcels sent by a delivery service. This example also
# demonstrates the use of unbounded timebucket duration ('*') for 'ts.delta_t' time series
# aggregate function.
df_pack_pti <- tbl(con, "package_tracking_pti")
df_grp <- df_pack_pti %>% group_by_time(timebucket.duration = "*", value.expression = "parcel_number")
df_out <- df_grp %>% summarise(delta_t = ts.delta_t("Status LIKE 'picked%up%customer'", "Status LIKE 'delivered%customer'"))
df_out %>% arrange(TIMECODE_RANGE, parcel_number)