Description
Histogram analysis reveals the distribution of continuous numeric or date values in a
column. Histogram analysis is also referred to as binning because it counts the
occurrence of values in a series of numeric ranges called bins. Adaptive Histogram
analysis supplements Histogram analysis by offering options to further subdivide the
distribution. You can specify the frequency percentage above which a single value
should be treated as a Spike, and a similar percentage above which a bin should be
subdivided. The Adaptive Histogram analysis modifies the computed equal-sized bins to
include a separate bin for each spike value, and to further subdivide a bin with an
excessive number of values, returning counts and boundaries for each resulting bin.
The subdivision of a bin with too many values is performed by first dividing the bin
by the same originally requested number of bins, and then merging this with a subdivision
in the region around the mean value within the bin, the region being the mean +/- the
standard deviation of the values in the bin. Subdividing can optionally be performed
using quantiles, giving approximately equally distributed sub-bins. (The quantiles
option is not recommended for tbl_teradata with extremely large numbers of
rows, such as input DataFrame with billions of rows.)
Adaptive binning can reduce the number of function calls needed to understand the
distribution of the values assumed by a column. Without adaptive binning, spike values
and/or overpopulated bins can distort the bin counts, as they are not separated or
subdivided unless the Adaptive Histogram function is used. It should be noted that
adaptive binning does not offer many of the specialized options that the standard
Histogram analysis does.
Usage
td_adaptive_histogram_valib(data, columns, ...)
Arguments
data |
Required Argument. |
columns |
Required Argument.
Types: character OR vector of Strings (character) |
... |
Specifies other arguments supported by the function as described in the 'Other Arguments' section. |
Value
Function returns an object of class "td_adaptive_histogram_valib" which is a named list containing object of class "tbl_teradata".cr Named list member can be referenced directly with the "$" operator using name: result.
Other Arguments
bins
Optional Argument.
Specifies the number of equal width bins to create. If multiple columns
are requested, multiple bin sizes may be specified, such as bins=c(5, 10).
If fewer sizes are specified than columns, leftover columns are associated
with the default size of 10 bins.
Default Value: 10
Types: integer OR list of Integers (int)
exclude.columns
Optional Argument.
Specifies the name(s) of the column(s) to exclude from the
analysis, if a column specifier such as 'all', 'allnumeric',
'allnumericanddate' is used in the "columns" argument.
Types: character OR vector of Strings (character)
spike.threshold
Optional Argument.
Specifies a percentage of rows, expressed as an integer
from 1 to 100, above which an individual value of a column
is identified as a separate bin. Values with this or a larger
percentage of rows are identified as a Spike.
Default Value: 10 (10
Types: integer
subdivision.method
Optional Argument.
Specifies the option to subdivide the bins.
Permitted Values:
'means' - Subdivide bins with too many values using a range of +/- the standard deviation around the mean value in the bin.
'quantiles' - Subdivide bins with too many values using quantiles, giving approximately equally distributed bins.
Default Value: 'means'
Types: character
subdivision.threshold
Optional Argument.
Specifies a percentage of rows, expressed as an integer
from 0 to 100, above which a bin is subdivided into sub-bins.
Values with this or a larger percentage of rows are
subdivided into sub-bins.
Default Value: 30 (30
Types: integer
filter
Optional Argument.
Specifies the clause to filter rows selected for analysis within Adaptive
Histogram.
For example,
filter = "cust_id > 0"
Types: character
Examples
# Notes:
# 1. To execute Vantage Analytic Library functions, set options 'val.install.location' to
# the database name where Vantage analytic library functions are installed.
# 2. Datasets used in these examples can be loaded using Vantage Analytic Library installer.
# Set the option 'val.install.location'.
options(val.install.location = "SYSLIB")
# Get remote data source connection.
con <- td_get_context()$connection
# Create an object of class "tbl_teradata".
df <- tbl(con, "customer")
print(df)
# Example 1: Shows execution with 10 equal width bins created by default for the
# 'income' column.
# The default value of 10 is used for "spike.threshold", and 30 for
# "subdivision.threshold", with means as the default "subdivision.method".
obj <- td_adaptive_histogram_valib(data=df, columns="income")
# Print the results.
print(obj$result)
# Example 2: Shows execution on two columns with different numbers of bins, 5 for
# 'income' and 3 for 'age'. Threshold values and other parameters assume
# default values.
obj <- td_adaptive_histogram_valib(data=df,
columns=c("income", "age"),
bins=c(5,3))
# Print the results.
print(obj$result)
# Example 3: Shows execution analysis done on two columns 'income' and 'age' with 5 bins
# for 'income' and default for 'age'. The non-default values of 12 is used for
# "spike.threshold", and 24 for "subdivision.threshold", with 'quantiles' as
# the default "subdivision.method".
obj <- td_adaptive_histogram_valib(data=df,
columns=c("income", "age"),
bins=5,
spike.threshold=12,
subdivision.method="quantiles",
subdivision.threshold=24,
filter="cust_id > 0")
# Print the results.
print(obj$result)