Description
Histogram analysis reveals the distribution of continuous numeric or date values
in a column. Histogram analysis is also referred to as binning because it counts the
occurrence of values in a series of numeric ranges called bins. The histogram
analysis provides a number of ways to define bins, allowing multidimensional binning,
overlaying of categorical data, and the calculation of numeric statistics within
bins. If you set the desired number of equal sized data bins, the desired number of
bins with a nearly equal number of values, a desired width, or the specific boundaries,
the Histogram analysis separates the data to show its distributional properties.
It does this by separating the data by bin number and gives counts and percentages
over the requested rows. Percentages always sum to 100
to specify a number of equal sized data bins in which the analysis determines the
minimum and maximum value, as well as a user-specified minimum and maximum value.
If the minimum and maximum are specified, all values less than the minimum are put in
bin 0, while all values greater than the maximum are put in bin N+1. The same is
true when the boundary option is specified.
The Histogram analysis optionally provides subtotals within each bin of the count,
percentage within the bin and percentage overall for each value or combination of
values of one or more overlaid columns. Another option is provided to collect simple
statistics for a binned column or another column of numeric or date type within the
table, providing the minimum, maximum, mean, and standard deviation. When statistics
are collected for a date type column, the standard deviation is given in units of days.
Histogram analysis can be performed on columns of numeric or date data type.
Usage
td_histogram_valib(data, columns, ...)
Arguments
data |
Required Argument. |
columns |
Required Argument.
Types: character OR vector of Strings (character) |
... |
Specifies other arguments supported by the function as described in the 'Other Arguments' section. |
Value
Function returns an object of class "td_histogram_valib"
which is a named list containing object of class "tbl_teradata".
Named list member can be referenced directly with the "$" operator
using name: result.
Other Arguments
bins
Optional Argument.
Specifies the number of equal width bins to create. If multiple columns are
requested, multiple bin sizes may be specified, such as bins=c(5, 10).
If fewer sizes are specified than columns, left-over columns are associated
with the default size of 10 bins.
Default Value: 10
Types: integer OR vector of Integers (int)
bins.with.boundaries
Optional Argument.
Specifies the number of bins spanning a range specified by
the minimum and maximum values.
For example,
bins.with.boundaries=c(5,0,200) creates 5 bins ranging
from 0 to 200.
If multiple columns are requested, multiple sets of parameters
must be specified, such as
bins.with.boundaries=c("10, 0, 200000", "5, 0, 100").
Note that multiple values are provided as string with numbers
enclosed in curly braces ''. Each such value corresponds
to the value in "columns" argument.
Types: integer, character OR vector of Integers (int) or Strings (character)
boundaries
Optional Argument.
Specifies the boundaries that define the bins.
For example,
boundaries=c(0,50,100,150) provides 3 bins between 0 and 150
(0 to 50, 50 to 100, and 100 to 150).
If multiple columns are requested, multiple sets of parameters must
be specified, such as
boundariesc("0, 50000, 100000, 150000", "0, 50, 100").
Note that multiple values are provided as string with numbers enclosed
in curly braces ''. Each such value corresponds to the value in
"columns" argument.
Types: integer, character OR vector of Integers (int) or Strings (character)
quantiles
Optional Argument.
Specifies the number of approximately equally populated bins to create.
If multiple columns are requested, multiple quantile sizes may be
specified, such as quantiles=c(5, 10). If fewer sizes are specified than
columns, left-over columns are associated with the default size of 10
quantiles.
Default Value: 10
Types: integer OR vector of Integers (int)
widths
Optional Argument.
Specifies the width of the bins to create. If multiple columns are
requested, multiple widths must be specified, such as widths=c(5, 10).
If fewer sizes are specified than columns, an error is raised.
Types: integer OR vector of Integers (int)
exclude.columns
Optional Argument.
Specifies the name(s) of the column(s) to exclude from the
analysis, if a column specifier such as 'all', 'allnumeric',
'allnumericanddate' is used in the "columns" argument.
Types: character OR vector of Strings (character)
overlay.columns
Optional Argument.
Specifies a categorical variable with only a few values. If an
overlay column is specified, frequencies within each bin are
calculated for each value of that overlay column (frequencies for
crosstabs of values are given if more than one overlay column is
requested).
Note:
Use a specific column in either "overlay.columns" or "stats.columns", but not both.
Types: character OR vector of Strings (character)
stats.columns
Optional Argument.
Specifies a list of numeric columns/aliases for which simple
statistics are calculated (minimum, maximum, mean and standard
deviation) in each bin. This argument is not available for DATE
columns.
Note:
Use a specific column in either "overlay.columns" or "stats.columns", but not both.
Types: character OR vector of Strings (character)
hist.style
Optional Argument.
Specifies the histogram style to use for analysis.
Permitted Values:
'basic' - Creates a histogram for individual columns.
'crosstab' - Creates a multidimensional histogram by combining columns.
Default Value: 'basic'
Types: character
filter
Optional Argument.
Specifies the clause to filter rows selected for analysis within Histogram.
For example,
filter = "cust_id > 0"
Types: character
Examples
# Notes:
# 1. To execute Vantage Analytic Library functions, set options 'val.install.location' to
# the database name where Vantage analytic library functions are installed.
# 2. Datasets used in these examples can be loaded using Vantage Analytic Library installer.
# Set the option 'val.install.location'.
options(val.install.location = "SYSLIB")
# Get remote data source connection.
con <- td_get_context()$connection
# Create an object of class "tbl_teradata".
df <- tbl(con, "customer")
print(df)
# Example 1: Perform Histogram analysis using default values.
obj <- td_histogram_valib(data=df,
columns=c("income", "age"))
# Print the results.
print(obj$result)
# Example 2: Perform Histogram analysis using "overlay.columns".
obj <- td_histogram_valib(data=df,
columns=c("income", "age"),
overlay.columns="gender")
# Print the results.
print(obj$result)
# Example 3: Perform Histogram analysis using "stats.columns".
obj <- td_histogram_valib(data=df,
columns=c("income", "age"),
stats.columns="years_with_bank")
# Print the results.
print(obj$result)
# Example 4: Perform Histogram analysis using 'crosstab' style, with filter.
obj <- td_histogram_valib(data=df,
columns=c("income", "age"),
hist.style="crosstab",
filter="income > 0")
# Print the results.
print(obj$result)
# Example 5: Perform Histogram analysis by specifying number of bins.
obj <- td_histogram_valib(data=df,
columns="income",
bins=5)
# Print the results.
print(obj$result)
# Example 6: Perform Histogram analysis by specifying bin width.
obj <- td_histogram_valib(data=df,
columns="income",
widths=25000)
# Print the results.
print(obj$result)
# Example 7: Perform Histogram analysis by specifying number of quantiles.
obj <- td_histogram_valib(data=df,
columns="income",
quantiles=4)
# Print the results.
print(obj$result)
# Example 8: Perform Histogram analysis by specifying boundaries on single column.
obj <- td_histogram_valib(data=df,
columns="income",
boundaries=c(0, 50000, 100000, 150000))
# Print the results.
print(obj$result)
# Example 9: Perform Histogram analysis by specifying boundaries on multiple columns.
obj <- td_histogram_valib(data=df,
columns=c("income", "age"),
boundaries=c("{0, 50000, 100000, 150000}", "{0,25,50,75}"))
# Print the results.
print(obj$result)
# Example 10: Perform Histogram analysis by specifying bins with boundaries
# on single column.
obj <- td_histogram_valib(data=df,
columns="income",
bins.with.boundaries=c(10, 0, 200000))
# Print the results.
print(obj$result)
# Example 11: Perform Histogram analysis by specifying bins with boundaries on
# multiple columns.
obj <- td_histogram_valib(data=df,
columns=c("income", "age"),
bins.with.boundaries=c("{10, 0, 200000}", "{5, 0, 75}"))
# Print the results.
print(obj$result)