Teradata R Package Function Reference - 16.20 - SeriesSplitter - Teradata R Package

Teradata® R Package Function Reference

prodname
Teradata R Package
vrm_release
16.20
created_date
February 2020
category
Programming Reference
featnum
B700-4007-098K

Description

The SeriesSplitter (td_series_splitter_mle) function splits partitions into subpartitions (called splits) to balance the partitions for time series manipulation. The function creates an additional column that contains split identifiers. Each row contains the identifier of the split to which the row belongs. Optionally, the function also copies a specified number of boundary rows to each split.

Usage

  td_series_splitter_mle (
      data = NULL,
      partition.columns = NULL,
      duplicate.rows.count = 1,
      order.by.columns = NULL,
      split.count = 4,
      rows.per.split = 1000,
      accumulate = NULL,
      split.id.column = "split_id",
      return.stats.table = TRUE,
      values.before.first = "-1",
      values.after.last = NULL,
      duplicate.column = NULL,
      partial.split.id = FALSE,
      data.sequence.column = NULL
  )

Arguments

data

Required Argument.
Specifies the name of the input tbl_teradata to be split.

partition.columns

Required Argument.
Specifies the partitioning columns of input table "data". These columns determine the identity of a partition. For data type restrictions of these columns, see the ML Engine Documentation.
Types: character OR vector of Strings (character)

duplicate.rows.count

Optional Argument.
Specifies the number of rows to duplicate across split boundaries. By default, the function duplicates one row from the previous partition and one row from the next partition. If you specify only one value v1, then the function duplicates v1 rows from the previous partition and v1 rows from the next partition. If you specify two values v1 and v2, then the function duplicates v1 rows from the previous partition and v2 rows from the next partition. Each argument value must be non-negative integer less than or equal to 1000.
Default Value: 1
Types: numeric

order.by.columns

Optional Argument.
Specifies the ordering columns of input table "data". These columns establish the order of the rows and splits. Without this argument, the function can split the rows in any order.
Types: character OR vector of Strings (character)

split.count

Optional Argument.
Specifies the desired number of splits in a partition of the output table. The value of "split.count" must be a positive BIGINT, and its upper bound is the number of rows in the partition.
Note: If input table has multiple partitions, then you cannot specify "split.count". Instead, specify "rows.per.split". Base the value of "split.count" on the desired amount of parallelism.
For example, for a cluster with 10 vworkers, make "split.count" a multiple of 10. If the number of rows in input table (n) is not exactly divisible by "split.count", then the function estimates the number of splits in the partition, using this formula: ceiling (n / ceiling (n / split.count))
Default Value: 4
Types: numeric

rows.per.split

Optional Argument.
Specifies the desired maximum number of rows in each split in the output table. If the number of rows in input table is not exactly divisible by "rows.per.split", then the last split contains fewer than "rows.per.split" rows, but no row contains more than "rows.per.split" rows. The value of "rows.per.split" must be a positive BIGINT.
Note: If input table has multiple partitions, then specify "rows.per.split" instead of "split.count".
Default Value: 1000
Types: numeric

accumulate

Optional Argument.
Specifies the names of input table columns (other than those specified by "partition.columns" and "order.by.columns") to copy to the output table. By default, only the columns specified by "partition.columns" and "order.by.columns" are copied to the output table.
Types: character OR vector of Strings (character)

split.id.column

Optional Argument.
Specifies the name for the output table column to contain the split identifiers. If the output table has another column name as that specified in "split.id.column", the function returns an error. Therefore, if the output table has a column named 'split_id' (specified by "accumulate", "partition.columns", or "order.by.columns"), you must use "split.id.column" to specify a different value.
Default Value: "split_id"
Types: character

return.stats.table

Optional Argument.
Specifies whether the function returns the data in "stats.table". When this value is FALSE, the function returns only the data in "output.table".
Default Value: TRUE
Types: logical

values.before.first

Optional Argument.
If "duplicate.rows.count" is non-zero and "order.by.columns" is specified, then "values.before.first" specifies the values to be stored in the ordering columns that precede the first row of the first split in a partition as a result of duplicating rows across split boundaries.
If "values.before.first" specifies only one value and "order.by.columns" specifies multiple ordering columns, then the specified value is stored in every ordering column.
If "values.before.first" specifies multiple values, then it must specify a value for each ordering column. The value and the ordering column must have the same data type. For the data type VARCHAR, the values are case-insensitive. The default values for different data types are:

  1. Numeric: -1,

  2. CHAR(n) or VARCHAR : "-1",

  3. Date- or time-based: 1900-01-01 0:00:00,

  4. CHARACTER: "0",

  5. Bit: 0,

  6. Boolean: "false",

  7. IP4 : 0.0.0.0,

  8. UUID: 0000-0000-0000-0000-0000-0000-0000-0000

Default Value: "-1"
Types: character

values.after.last

Optional Argument.
If "duplicate.rows.count" is non-zero and "order.by.columns" is specified, then "values.after.last" specifies the values to be stored in the ordering columns that follow the last row of the last split in a partition as a result of duplicating rows across split boundaries.
If "values.after.last" specifies only one value and "order.by.columns" specifies multiple ordering columns, then the specified value is stored in every ordering column.
If "values.after.last" specifies multiple values, then it must specify a value for each ordering column. The value and the ordering column must have the same data type. For the data type VARCHAR, the values are case-insensitive.
Default Value: NULL
Types: character

duplicate.column

Optional Argument.
Specifies the name of the column that indicates whether a row is duplicated from the neighboring split. If the row is duplicated, this column contains 1, otherwise it contains 0.
Types: character

partial.split.id

Optional Argument.
Specifies whether "split.id.column" contains only the numeric split identifier. If the value is TRUE, then "split.id.column" contains a numeric representation of the split identifier that is unique for each partition. To distribute the output tbl_teradata by split, use a combination of all partition columns and "split_id_column". If the value is FALSE, then "split.id.column" contains a string representation of the split that is unique across all partitions. The function generates the string representation by concatenating the partitioning columns with the order of the split inside the partition (the numeric representation). In the string representation, hyphens separate partitioning column names from each other and from the order. For example, "pcol1-pcol2-3".
Default Value: FALSE
Types: logical

data.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)

Value

Function returns an object of class "td_series_splitter_mle" which is a named list containing Teradata tbl objects. Named list members can be referenced directly with the "$" operator using following names:

  1. output.table

  2. stats.table

  3. output

Examples

    # Get the current context/connection
    con <- td_get_context()$connection
    
    # Load example data.
    loadExampleData("seriessplitter_example", "ibm_stock1")
    
    # Create remote tibble objects.
    ibm_stock1 <- tbl(con, "ibm_stock1")
    
    # Example 1 - This examples splits the time series stock data into subpartitions.
    td_series_splitter_out <- td_series_splitter_mle(data = ibm_stock1,
                                                 partition.columns = c("id"),
                                                 order.by.columns = c("period"),
                                                 accumulate = c("stockprice")
                                                )
    
    # Another example specifying the use of different arguments.
    td_series_splitter_out1 <- td_series_splitter_mle(data=ibm_stock1,
                                                  partition.columns='id',
                                                  order.by.columns = 'period',
                                                  split.count = 9,
                                                  split.id.column = 'split_id',
                                                  duplicate.rows.count = c(1,1),
                                                  return.stats.table = FALSE,
                                                  accumulate = 'stockprice',
                                                  values.after.last = NULL,
                                                  values.before.first = '1991-01-01',
                                                  partial.split.id = FALSE
                                                 )