| |
- Explore(data, columns=None, bins=10, bin_style='bins', max_comb_values=10000, max_unique_char_values=100, max_unique_num_values=20, min_comb_rows=25000, restrict_freq=True, restrict_threshold=1, statistical_method='population', stats_options=None, distinct=False, filter=None)
- DESCRIPTION:
Function performs basic statistical analysis on a set of selected teradataml
DataFrame(s), or on selected columns from teradataml DataFrame. It stores results
from four fundamental types of analysis based on simplified versions of the
Descriptive Statistics analysis:
1. Values
2. Statistics
3. Frequency
4. Histogram
Output teradataml DataFrames are produced for each type of analysis.
PARAMETERS:
data:
Required Argument.
Specifies the input data to perform basic statistical analysis.
Types: teradataml DataFrame
columns:
Optional Argument.
Specifies the name(s) of the column(s) to analyze.
Types: str OR list of Strings (str)
bins:
Optional Argument.
Specifies the number of equal width bins to create for Histogram analysis.
Default Value: 10
Types: int
bin_style:
Optional Argument.
Specifies the bin style for Histogram analysis.
Permitted Values: 'bins', 'quantiles'
Default Value: 'bins'
Types: str
max_comb_values:
Optional Argument.
Specifies the maximum number of combined values for frequency or histogram analysis.
Default Value: 10000
Types: int
max_unique_char_values:
Optional Argument.
Specifies the maximum number of unique character values for unrestricted frequency
analysis.
Default Value: 100
Types: int
max_unique_num_values:
Optional Argument.
Specifies the maximum number of unique date or numeric values for frequency analysis.
Default Value: 20
Types: int
min_comb_rows:
Optional Argument.
Specifies the minimum number of rows before frequency or histogram combining attempted.
Default Value: 25000
Types: int
restrict_freq:
Optional Argument.
Specifies the restricted frequency processing including prominent values.
Default Value: True
Types: bool
restrict_threshold:
Optional Argument.
Specifies the minimum percentage of rows a value must occur in, for inclusion in
results.
Default Value: 1
Types: int
statistical_method:
Optional Argument.
Specifies the method for calculating the statistics.
Permitted Values: 'population', 'sample'
Default Value: 'population'
Types: str
stats_options:
Optional Argument.
Specifies the basic statistics to be calculated for the Statistics analysis.
Permitted Values:
* all
* count (cnt)
* minimum (min)
* maximum (max)
* mean
* standarddeviation (std)
* skewness (skew)
* kurtosis (kurt)
* standarderror (ste)
* coefficientofvariance (cv)
* variance (var)
* sum
* uncorrectedsumofsquares (uss)
* correctedsumofsquares (css)
Types: str OR list of Strings (str)
distinct:
Optional Argument.
Specifies the unique values count for each selected column when this argument is
set to True.
Default Value: False
Types: bool
filter:
Optional Argument.
Specifies the clause to filter rows selected for data exploration.
For example,
filter = "cust_id > 0"
Types: str
RETURNS:
An instance of Explore.
Output teradataml DataFrames can be accessed using attribute references, such as
ExploreObj.<attribute_name>.
Output teradataml DataFrame attribute names are:
1. frequency_output
2. histogram_output
3. statistics_output
4. values_output
RAISES:
TeradataMlException, TypeError, ValueError
EXAMPLES:
# Notes:
# 1. To execute Vantage Analytic Library functions,
# a. import "valib" object from teradataml.
# b. set 'configure.val_install_location' to the database name where Vantage
# analytic library functions are installed.
# 2. Datasets used in these examples can be loaded using Vantage Analytic Library
# installer.
# Import valib object from teradataml to execute this function.
from teradataml import valib
# Set the 'configure.val_install_location' variable,
from teradataml import configure
configure.val_install_location = "SYSLIB"
# Create required teradataml DataFrame.
df = DataFrame("customer")
print(df)
# Example 1: Shows data exploration with default values.
obj = valib.Explore(data=df)
# Print the frequency results.
print(obj.frequency_output)
# Print the histogram results.
print(obj.histogram_output)
# Print the statistics results.
print(obj.statistics_output)
# Print the values results.
print(obj.values_output)
|