Teradata R Package Function Reference - Histogram - Teradata R Package - Look here for syntax, methods and examples for the functions included in the Teradata R Package.

Teradata® R Package Function Reference

Product
Teradata R Package
Release Number
16.20
Published
February 2020
Language
English (United States)
Last Update
2020-02-28
dita:id
B700-4007
lifecycle
previous
Product Category
Teradata Vantage

Description

Histograms are useful for assessing the shape of a data distribution. The Histogram function calculates the frequency distribution of a data set using sophisticated binning techniques that can automatically calculate the bin width and number of bins. The function maps each input row to one bin and returns the frequency (row count) and proportion (percentage of rows) of each bin.

Usage

  td_histogram_mle (
      data = NULL,
      auto.bin = NULL,
      custom.bin.table = NULL,
      custom.bin.column = NULL,
      bin.size = NULL,
      start.value = NULL,
      end.value = NULL,
      value.column = NULL,
      inclusion = "left",
      groupby.columns = NULL,
      data.sequence.column = NULL,
      custom.bin.table.sequence.column = NULL
  )

Arguments

data

Required Argument.
Specifies the tbl_teradata containing the input data.

auto.bin

Optional Argument.
Specifies either the algorithm to be used for selecting bin boundaries or the approximate number of bins to be found. The permitted values are STURGES, SCOTT, or a positive integer which should be within quotes. If this argument is present, arguments "custom.bin.table", "custom.bin.column", "start.value", "bin.size", and "end.value" cannot be present.

custom.bin.table

Optional Argument.
Specifies a tbl_teradata containing the boundary points between bins. If this argument is present, "custom.bin.column" must also be present, and arguments "auto.bin", "start.value", "bin.size", and "end.value" cannot be present.

custom.bin.column

Optional Argument.
Specifies the column in the "custom.bin.table" containing the boundary values. Input columns must contain numeric SQL types. If this argument is present, "custom.bin.table" must also be present, and arguments "auto.bin", "start.value", "bin.size", and "end.value" cannot be present.

bin.size

Optional Argument.
For equally sized bins, a double value specifying the width of the bin. Omit this argument if you are not using equally sized bins. The input value must be greater than 0.0. If this argument is present, "start.value" and "end.value" must also be present, and arguments "auto.bin", "custom.bin.table" and "custom.bin.column" cannot be present.

start.value

Optional Argument.
The smallest value to be used in binning. If this argument is present, "bin.size" and "end.value" must also be present, and arguments "auto.bin", "custom.bin.table" and "custom.bin.column" cannot be present.

end.value

Optional Argument.
The largest value to be used in binning. If this argument is present, "start.value" and "bin.size" must also be present, and arguments "auto.bin", "custom.bin.table" and "custom.bin.column" cannot be present.

value.column

Required Argument.
Specifies the column in the input tbl_teradata for which statistics will be computed. Column must contain a numeric SQL types (integer, bigint, real, double precision, numeric, decimal, smallint).

inclusion

Optional Argument.
Indicates whether points on bin boundaries should be included in the bin on the left or the bin on the right.
Default Value: "left"
Permitted Values: left, right

groupby.columns

Optional Argument.
Specifies the columns in the input tbl_teradata used to group values for binning. These columns cannot contain floating point values.

data.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.

custom.bin.table.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "custom.bin.table". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.

Value

Function returns an object of class "td_histogram_mle" which is a named list containing Teradata tbl objects.
Named list members can be referenced directly with the "$" operator using following names:

  1. output.table

  2. output

Examples

    library(ggplot2)
    
    # Get the current context/connection
    con <- td_get_context()$connection
    
    # Load example data.
    loadExampleData("histogram_example", "cars_hist", "bin_breaks")  
    
    # The cars_hist table has the cylinder (cyl) and horsepower (hp) data for different car models.
    cars_hist <- tbl(con, "cars_hist")
    # The bin_breaks table has the boundary values for the custom bins to be used while generating the histogram
    bin_breaks <- tbl(con, "bin_breaks")
    
    # Example 1 - Generate histogram based on the cars horsepower using STURGES rule.
    td_histogram_out <- td_histogram_mle(data = cars_hist,
                                     auto.bin = "Sturges",
                                     value.column = "hp"
                                     )
    # Plot showing the percentage of cars in each histogram bin
    ggplot(as.data.frame(td_histogram_out$output.table), aes(x=bin_end, y=bin_percent)) + geom_bar(stat = "identity", fill = "#FF6666") + labs(x="Horsepower", y="Percentage")
    
    # Example 2 - Generate histogram based on the cars horsepower by setting custom bin size, start and end values.
    td_histogram_out <- td_histogram_mle(data = cars_hist,
                                     bin.size = 50,
                                     start.value = 20,
                                     end.value = 400,
                                     value.column = "hp",
                                     inclusion = "right",
                                     groupby.columns = c("cyl")
                                     )
    
    # Example 3 - Generate histogram using custom bins from a custom table. Here cylinder (cyl) column is also used to group the input data.
    td_histogram_out <- td_histogram_mle(data = cars_hist,
                                     custom.bin.table = bin_breaks,
                                     custom.bin.column = "break_values",
                                     value.column = "hp",
                                     groupby.columns = c("cyl")
                                     )