| |
- Explore(data, columns=None, bins=10, bin_style='bins', max_comb_values=10000, max_unique_char_values=100, max_unique_num_values=20, min_comb_rows=25000, restrict_freq=True, restrict_threshold=1, statistical_method='population', stats_options=None, distinct=False, filter=None, gen_sql=False)
- DESCRIPTION:
Function performs basic statistical analysis on a set of selected teradataml
DataFrame(s), or on selected columns from teradataml DataFrame. It stores results
from four fundamental types of analysis based on simplified versions of the
Descriptive Statistics analysis:
1. Values
2. Statistics
3. Frequency
4. Histogram
Output teradataml DataFrames are produced for each type of analysis.
PARAMETERS:
data:
Required Argument.
Specifies the input data to perform basic statistical analysis.
Types: teradataml DataFrame
columns:
Optional Argument.
Specifies the name(s) of the column(s) to analyze.
Types: str OR list of Strings (str)
bins:
Optional Argument.
Specifies the number of equal width bins to create for Histogram analysis.
Default Value: 10
Types: int
bin_style:
Optional Argument.
Specifies the bin style for Histogram analysis.
Permitted Values: 'bins', 'quantiles'
Default Value: 'bins'
Types: str
max_comb_values:
Optional Argument.
Specifies the maximum number of combined values for frequency or histogram analysis.
Default Value: 10000
Types: int
max_unique_char_values:
Optional Argument.
Specifies the maximum number of unique character values for unrestricted frequency
analysis.
Default Value: 100
Types: int
max_unique_num_values:
Optional Argument.
Specifies the maximum number of unique date or numeric values for frequency analysis.
Default Value: 20
Types: int
min_comb_rows:
Optional Argument.
Specifies the minimum number of rows before frequency or histogram combining attempted.
Default Value: 25000
Types: int
restrict_freq:
Optional Argument.
Specifies the restricted frequency processing including prominent values.
Default Value: True
Types: bool
restrict_threshold:
Optional Argument.
Specifies the minimum percentage of rows a value must occur in, for inclusion in
results.
Default Value: 1
Types: int
statistical_method:
Optional Argument.
Specifies the method for calculating the statistics.
Permitted Values: 'population', 'sample'
Default Value: 'population'
Types: str
stats_options:
Optional Argument.
Specifies the basic statistics to be calculated for the Statistics analysis.
Permitted Values:
* all
* count (cnt)
* minimum (min)
* maximum (max)
* mean
* standarddeviation (std)
* skewness (skew)
* kurtosis (kurt)
* standarderror (ste)
* coefficientofvariance (cv)
* variance (var)
* sum
* uncorrectedsumofsquares (uss)
* correctedsumofsquares (css)
Types: str OR list of Strings (str)
distinct:
Optional Argument.
Specifies the unique values count for each selected column when this argument is
set to True.
Default Value: False
Types: bool
filter:
Optional Argument.
Specifies the clause to filter rows selected for data exploration.
For example,
filter = "cust_id > 0"
Types: str
gen_sql:
Optional Argument.
Specifies whether to store and return the generated function SQL or not.
When set to True, function SQL is generated as well as executed, which can be accessed
using show_query() method, otherwise SQL is just executed but not returned.
Default Value: False
Types: bool
RETURNS:
An instance of Explore.
Output teradataml DataFrames can be accessed using attribute references, such as
ExploreObj.<attribute_name>.
Output teradataml DataFrame attribute names are:
1. frequency_output
2. histogram_output
3. statistics_output
4. values_output
RAISES:
TeradataMlException, TypeError, ValueError
EXAMPLES:
# Notes:
# 1. To execute Vantage Analytic Library functions,
# a. import "valib" object from teradataml.
# b. set 'configure.val_install_location' to the database name where Vantage
# analytic library functions are installed.
# 2. Datasets used in these examples can be loaded using Vantage Analytic Library
# installer.
# Import valib object from teradataml to execute this function.
from teradataml import valib
# Set the 'configure.val_install_location' variable,
from teradataml import configure
configure.val_install_location = "SYSLIB"
# Create required teradataml DataFrame.
df = DataFrame("customer")
print(df)
# Example 1: Shows data exploration with default values.
obj = valib.Explore(data=df)
# Print the frequency results.
print(obj.frequency_output)
# Print the histogram results.
print(obj.histogram_output)
# Print the statistics results.
print(obj.statistics_output)
# Print the values results.
print(obj.values_output)
# Example 2: Generate SQL for the function and execute the same.
obj = valib.Explore(data=df,gen_sql=True)
# Print the generated SQL.
print(obj.show_query("sql"))
# Print both generated SQL and stored procedure call.
print(obj.show_query("both"))
# Print the stored procedure call.
print(obj.show_query())
print(obj.show_query("sp"))
|