SeriesSplitter
Description
The SeriesSplitter function splits partitions into subpartitions
(called splits) to balance the partitions for time series
manipulation. The function creates an additional column that contains
split identifiers. Each row contains the identifier of the split
to which the row belongs. Optionally, the function also copies a
specified number of boundary rows to each split.
Usage
td_series_splitter_mle (
data = NULL,
partition.columns = NULL,
duplicate.rows.count = 1,
order.by.columns = NULL,
split.count = 4,
rows.per.split = 1000,
accumulate = NULL,
split.id.column = "split_id",
return.stats.table = TRUE,
values.before.first = "-1",
values.after.last = NULL,
duplicate.column = NULL,
partial.split.id = FALSE,
data.sequence.column = NULL
)
Arguments
data |
Required Argument.
Specifies the name of the input tbl_teradata to be split.
|
partition.columns |
Required Argument.
Specifies the partitioning columns of input tbl_teradata "data". These columns
determine the identity of a partition. For data type restrictions of
these columns, see the ML Engine Documentation.
Types: character OR vector of Strings (character)
|
duplicate.rows.count |
Optional Argument.
Specifies the number of rows to duplicate across split boundaries. By
default, the function duplicates one row from the previous partition
and one row from the next partition. If you specify only one value v1,
then the function duplicates v1 rows from the previous partition and
v1 rows from the next partition. If you specify two values v1 and v2,
then the function duplicates v1 rows from the previous partition and
v2 rows from the next partition. Each argument value must be
non-negative integer less than or equal to 1000.
Default Value: 1
Types: numeric OR vector of numerics
|
order.by.columns |
Optional Argument.
Specifies the ordering columns of input tbl_teradata. These columns
establish the order of the rows and splits. Without this argument,
the function can split the rows in any order.
Types: character OR vector of Strings (character)
|
split.count |
Optional Argument.
Specifies the desired number of splits in a partition of the output
tbl_teradata. The value of "split.count" must be a positive integer, and its
upper bound is the number of rows in the partition.
Note: If input tbl_teradata has multiple partitions, then you cannot specify
"split.count". Instead, specify "rows.per.split". Base the value of
"split.count" on the desired amount of parallelism.
For example, for a cluster with 10 vworkers, make "split.count" a multiple
of 10. If the number of rows in input tbl_teradata (n) is not exactly
divisible by "split.count", then the function estimates the number of
splits in the partition, using this formula:
ceiling (n / ceiling (n / split_count) )
Default Value: 4
Types: numeric
|
rows.per.split |
Optional Argument.
Specifies the desired maximum number of rows in each split in the
output tbl_teradata. If the number of rows in input tbl_teradata is not exactly
divisible by "rows.per.split", then the last split contains fewer
than "rows.per.split" rows, but no row contains more than "rows.per.split"
rows. The value of "rows.per.split" must be a positive integer.
Note: If input tbl_teradata has multiple partitions, then specify
"rows.per.split" instead of "split.count".
Default Value: 1000
Types: numeric
|
accumulate |
Optional Argument.
Specifies the names of input tbl_teradata columns (other than those
specified by "partition.columns" and "order.by.columns") to copy to the
output tbl_teradata. By default, only the columns specified by
"partition.columns" and "order.by.columns" are copied to the output
tbl_teradata.
Types: character OR vector of Strings (character)
|
split.id.column |
Optional Argument.
Specifies the name for the output tbl_teradata column to contain the split
identifiers. If the output tbl_teradata has another column name as that
specified in "split.id.column", the function returns an error.
Therefore, if the output tbl_teradata has a column named 'split_id' (specified
by "accumulate", "partition.columns", or "order.by.columns"), you must use
"split.id.column" to specify a different value.
Default Value: "split_id"
Types: character
|
return.stats.table |
Optional Argument.
Specifies whether the function returns the data in "stats.table". When
this value is FALSE, the function returns only the data in "output.table".
Default Value: TRUE
Types: logical
|
values.before.first |
Optional Argument.
If "duplicate.rows.count" is non-zero and "order.by.columns" is specified,
then "values.before.first" specifies the values to be stored in the
ordering columns that precede the first row of the first split in a
partition as a result of duplicating rows across split boundaries.
If "values.before.first" specifies only one value and "order.by.columns"
specifies multiple ordering columns, then the specified value is
stored in every ordering column.
If "values.before.first" specifies multiple values, then it must
specify a value for each ordering column. The value and the ordering
column must have the same data type. For the data type VARCHAR, the
values are case-insensitive. The default values for different data
types are:
Numeric: -1
CHAR(n) or VARCHAR : "-1"
Date- or time-based: 1900-01-01 0:00:00
CHARACTER: "0"
Bit: 0
Boolean: "false"
IP4 : 0.0.0.0
UUID: 0000-0000-0000-0000-0000-0000-0000-0000
Default Value: "-1"
Types: character OR vector of characters
|
values.after.last |
Optional Argument.
If "duplicate.rows.count" is non-zero and "order.by.columns" is specified,
then "values.after.last" specifies the values to be stored in the
ordering columns that follow the last row of the last split in a
partition as a result of duplicating rows across split boundaries.
If "values.after.last" specifies only one value and "order.by.columns"
specifies multiple ordering columns, then the specified value is
stored in every ordering column.
If "values.after.last" specifies multiple values, then it must specify
a value for each ordering column. The value and the ordering column must
have the same data type. For the data type VARCHAR, the values are
case-insensitive.
Default Value: NULL
Types: character OR vector of characters
|
duplicate.column |
Optional Argument.
Specifies the name of the column that indicates whether a row is
duplicated from the neighboring split. If the row is duplicated, this
column contains 1; otherwise it contains 0.
Types: character
|
partial.split.id |
Optional Argument.
Specifies whether "split.id.column" contains only the numeric split
identifier. If the value is TRUE, then "split.id.column" contains a
numeric representation of the split identifier that is unique for
each partition. To distribute the output tbl_teradata by split, use a
combination of all partition columns and "split_id_column". If the
value is FALSE, then "split.id.column" contains a string
representation of the split that is unique across all partitions. The
function generates the string representation by concatenating the
partitioning columns with the order of the split inside the partition
(the numeric representation). In the string representation, hyphens
separate partitioning column names from each other and from the
order. For example, "pcol1-pcol2-3".
Default Value: FALSE
Types: logical
|
data.sequence.column |
Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row
of the input argument "data". The argument is used to ensure
deterministic results for functions which produce results that vary
from run to run.
Types: character OR vector of Strings (character)
|
Value
Function returns an object of class "td_series_splitter_mle" which is
a named list containing objects of class "tbl_teradata".
Named list members can be referenced directly with the "$" operator
using the following names:
output.table
-
stats.table
output
Examples
# Get the current context/connection
con <- td_get_context()$connection
# Load example data.
loadExampleData("seriessplitter_example", "ibm_stock1")
# Create object(s) of class "tbl_teradata".
ibm_stock1 <- tbl(con, "ibm_stock1")
# Example 1: This examples splits the time series stock data into subpartitions.
td_series_splitter_out <- td_series_splitter_mle(data = ibm_stock1,
partition.columns = c("id"),
order.by.columns = c("period"),
accumulate = c("stockprice")
)
# Example 2: Specifying the use of different arguments.
td_series_splitter_out1 <- td_series_splitter_mle(data=ibm_stock1,
partition.columns='id',
order.by.columns = 'period',
split.count = 9,
split.id.column = 'split_id',
duplicate.rows.count = c(1,1),
return.stats.table = FALSE,
accumulate = 'stockprice',
values.after.last = NULL,
values.before.first = '1991-01-01',
partial.split.id = FALSE
)