Teradata R Package Function Reference | 17.00 - 17.00 - Histogram - Teradata R Package

Teradata® R Package Function Reference

prodname
Teradata R Package
vrm_release
17.00
created_date
September 2020
category
Programming Reference
featnum
B700-4007-090K

Description

Histograms are useful for assessing the shape of a data distribution. The Histogram function calculates the frequency distribution of a data set using sophisticated binning techniques that can automatically calculate the bin width and number of bins. The function maps each input row to one bin and returns the frequency (row count) and proportion (percentage of rows) of each bin.

Usage

  td_histogram_mle (
      data = NULL,
      auto.bin = NULL,
      custom.bin.table = NULL,
      custom.bin.column = NULL,
      bin.size = NULL,
      start.value = NULL,
      end.value = NULL,
      value.column = NULL,
      inclusion = "left",
      groupby.columns = NULL,
      data.sequence.column = NULL,
      custom.bin.table.sequence.column = NULL
  )

Arguments

data

Required Argument.
Specifies the tbl_teradata containing the input data.

auto.bin

Optional Argument.
Specifies either the algorithm to be used for selecting bin boundaries or the approximate number of bins to be found. The permitted values are STURGES, SCOTT, or a positive integer which should be within quotes. If this argument is present, arguments "custom.bin.table", "custom.bin.column", "start.value", "bin.size", and "end.value" cannot be present.
Types: character

custom.bin.table

Optional Argument.
Specifies the tbl_teradata containing the boundary points between bins. If this argument is present, "custom.bin.column" must also be present, and arguments "auto.bin", "start.value", "bin.size", and "end.value" cannot be present.

custom.bin.column

Optional Argument.
Specifies the column in the "custom.bin.table" containing the boundary values. Input columns must contain numeric SQL types. If this argument is present, "custom.bin.table" must also be present; "auto.bin", "start.value", "bin.size", and "end.value" cannot be present.
Types: character

bin.size

Optional Argument.
Specifies the bin size. For equally sized bins, a double value specifying the width of the bin. Omit this argument if you are not using equally sized bins. The input value must be greater than 0.0. If this argument is present, "start.value" and "end.value" must also be present, and arguments "auto.bin", "custom.bin.table" and "custom.bin.column" cannot be present.
Types: numeric

start.value

Optional Argument.
Specifies the smallest value to be used in binning. If this argument is present, "bin.size" and "end.value" must also be present, and arguments "auto.bin", "custom.bin.table" and "custom.bin.column" cannot be present.
Types: numeric

end.value

Optional Argument.
Specifies the largest value to be used in binning. If this argument is present, "start.value" and "bin.size" must also be present, and arguments "auto.bin", "custom.bin.table" and "custom.bin.column" cannot be present.
Types: numeric

value.column

Required Argument.
Specifies the column in the input tbl_teradata for which statistics will be computed. Column must contain a numeric SQL types (integer, bigint, real, double precision, numeric, decimal, smallint).
Types: character

inclusion

Optional Argument.
Specifies whether points on bin boundaries should be included in the bin on the left or the bin on the right.
Default Value: "left"
Permitted Values: left, right
Types: character

groupby.columns

Optional Argument.
Specifies the columns in the input tbl_teradata used to group values for binning. These columns cannot contain floating point values.
Types: character OR vector of Strings (character)

data.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)

custom.bin.table.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "custom.bin.table". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)

Value

Function returns an object of class "td_histogram_mle" which is a named list containing objects of class "tbl_teradata".
Named list members can be referenced directly with the "$" operator using the following names:

  1. output.table

  2. output

Examples

    library(ggplot2)
    
    # Get the current context/connection
    con <- td_get_context()$connection
    
    # Load example data.
    loadExampleData("histogram_example", "cars_hist", "bin_breaks")

    # The cars_hist table has the cylinder (cyl) and horsepower (hp) data for 
    # different car models.
    cars_hist <- tbl(con, "cars_hist")
    # The bin_breaks table has the boundary values for the custom bins to be 
    # used while generating the histogram
    bin_breaks <- tbl(con, "bin_breaks")

    # Example 1: Generate histogram based on the cars horsepower using STURGES rule.
    td_histogram_out <- td_histogram_mle(data = cars_hist,
                                         auto.bin = "Sturges",
                                         value.column = "hp"
                                        )
                                        
    # Plot showing the percentage of cars in each histogram bin
    ggplot(as.data.frame(td_histogram_out$output.table),
           aes(x=bin_end, y=bin_percent))
    geom_bar(stat = "identity", fill = "#FF6666")
    labs(x="Horsepower", y="Percentage")

    # Example 2: Generate histogram based on the cars horsepower by setting 
    # custom bin size, start and end values.
    td_histogram_out <- td_histogram_mle(data = cars_hist,
                                         bin.size = 50,
                                         start.value = 20,
                                         end.value = 400,
                                         value.column = "hp",
                                         inclusion = "right",
                                         groupby.columns = c("cyl")
                                         )

    # Example 3: Generate histogram using custom bins from a custom tbl_teradata. 
    # Here cylinder (cyl) column is also used to group the input data.
    td_histogram_out <- td_histogram_mle(data = cars_hist,
                                         custom.bin.table = bin_breaks,
                                         custom.bin.column = "break_values",
                                         value.column = "hp",
                                         groupby.columns = c("cyl")
                                         )