Teradata Package for R Function Reference | 17.00 - Histogram - Teradata Package for R - Look here for syntax, methods and examples for the functions included in the Teradata Package for R.

Teradata® Package for R Function Reference

Product
Teradata Package for R
Release Number
17.00
Published
July 2021
Language
English (United States)
Last Update
2023-08-08
dita:id
B700-4007
NMT
no
Product Category
Teradata Vantage
Histogram

Description

Histograms are useful for assessing the shape of a data distribution. The Histogram function calculates the frequency distribution of a data set using sophisticated binning techniques that can automatically calculate the bin width and number of bins. The function maps each input row to one bin and returns the frequency (row count) and proportion (percentage of rows) of each bin.

Usage

  td_histogram_mle (
      data = NULL,
      auto.bin = NULL,
      custom.bin.table = NULL,
      custom.bin.column = NULL,
      bin.size = NULL,
      start.value = NULL,
      end.value = NULL,
      value.column = NULL,
      inclusion = "left",
      groupby.columns = NULL,
      data.sequence.column = NULL,
      custom.bin.table.sequence.column = NULL
  )

Arguments

data

Required Argument.
Specifies the tbl_teradata containing the input data.

auto.bin

Optional Argument.
Specifies either the algorithm to be used for selecting bin boundaries or the approximate number of bins to be found. The permitted values are STURGES, SCOTT, or a positive integer which should be within quotes. If this argument is present, arguments "custom.bin.table", "custom.bin.column", "start.value", "bin.size", and "end.value" cannot be present.
Types: character

custom.bin.table

Optional Argument.
Specifies the tbl_teradata containing the boundary points between bins. If this argument is present, "custom.bin.column" must also be present, and arguments "auto.bin", "start.value", "bin.size", and "end.value" cannot be present.

custom.bin.column

Optional Argument.
Specifies the column in the "custom.bin.table" containing the boundary values. Input columns must contain numeric SQL types. If this argument is present, "custom.bin.table" must also be present; "auto.bin", "start.value", "bin.size", and "end.value" cannot be present.
Types: character

bin.size

Optional Argument.
Specifies the bin size. For equally sized bins, a double value specifying the width of the bin. Omit this argument if you are not using equally sized bins. The input value must be greater than 0.0. If this argument is present, "start.value" and "end.value" must also be present, and arguments "auto.bin", "custom.bin.table" and "custom.bin.column" cannot be present.
Types: numeric

start.value

Optional Argument.
Specifies the smallest value to be used in binning. If this argument is present, "bin.size" and "end.value" must also be present, and arguments "auto.bin", "custom.bin.table" and "custom.bin.column" cannot be present.
Types: numeric

end.value

Optional Argument.
Specifies the largest value to be used in binning. If this argument is present, "start.value" and "bin.size" must also be present, and arguments "auto.bin", "custom.bin.table" and "custom.bin.column" cannot be present.
Types: numeric

value.column

Required Argument.
Specifies the column in the input tbl_teradata for which statistics will be computed. Column must contain a numeric SQL types (integer, bigint, real, double precision, numeric, decimal, smallint).
Types: character

inclusion

Optional Argument.
Specifies whether points on bin boundaries should be included in the bin on the left or the bin on the right.
Default Value: "left"
Permitted Values: left, right
Types: character

groupby.columns

Optional Argument.
Specifies the columns in the input tbl_teradata used to group values for binning. These columns cannot contain floating point values.
Types: character OR vector of Strings (character)

data.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)

custom.bin.table.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "custom.bin.table". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)

Value

Function returns an object of class "td_histogram_mle" which is a named list containing objects of class "tbl_teradata".
Named list members can be referenced directly with the "$" operator using the following names:

  1. output.table

  2. output

Examples

  
    library(ggplot2)
    
    # Get the current context/connection
    con <- td_get_context()$connection
    
    # Load example data.
    loadExampleData("histogram_example", "cars_hist", "bin_breaks")

    # The cars_hist table has the cylinder (cyl) and horsepower (hp) data for 
    # different car models.
    cars_hist <- tbl(con, "cars_hist")
    # The bin_breaks table has the boundary values for the custom bins to be 
    # used while generating the histogram
    bin_breaks <- tbl(con, "bin_breaks")

    # Example 1: Generate histogram based on the cars horsepower using STURGES rule.
    td_histogram_out <- td_histogram_mle(data = cars_hist,
                                         auto.bin = "Sturges",
                                         value.column = "hp"
                                        )
                                        
    # Plot showing the percentage of cars in each histogram bin
    ggplot(as.data.frame(td_histogram_out$output.table),
           aes(x=bin_end, y=bin_percent))
    geom_bar(stat = "identity", fill = "#FF6666")
    labs(x="Horsepower", y="Percentage")

    # Example 2: Generate histogram based on the cars horsepower by setting 
    # custom bin size, start and end values.
    td_histogram_out <- td_histogram_mle(data = cars_hist,
                                         bin.size = 50,
                                         start.value = 20,
                                         end.value = 400,
                                         value.column = "hp",
                                         inclusion = "right",
                                         groupby.columns = c("cyl")
                                         )

    # Example 3: Generate histogram using custom bins from a custom tbl_teradata. 
    # Here cylinder (cyl) column is also used to group the input data.
    td_histogram_out <- td_histogram_mle(data = cars_hist,
                                         custom.bin.table = bin_breaks,
                                         custom.bin.column = "break_values",
                                         value.column = "hp",
                                         groupby.columns = c("cyl")
                                         )