| |
- TextAnalyzer(data, columns, exclude_columns=None, analyze_numerics=True, analyze_unicode=False, gen_sql_only=False)
- DESCRIPTION:
When working with character data it is useful to determine the data type and what
data can be stored in the database. The TextAnalyzer function analyzes character
data and distinguishes if the field is a numeric type, date, time, timestamp, or
character data and returns two output DataFrames, one containing the analysis
results and the second one containing the column data type matrix having the
progression of data type through the series of steps (mentioned below).
The function runs a series of tests to distinguish what the correct underlying
type of each selected column should be. The first test performed on the column is
the MIN and the MAX test. The MIN and MAX values of a column are retrieved from
the database and tested to determine what type the values are. The next test is a
sample test which retrieves a small sample of data for each column and again
assesses what type they should be. The next test, extended Numeric analysis, is
for fields that were determined to be numeric and it tries to classify them in a
more specific category if possible. For instance, a column that is considered a
FLOAT type after the first two tests might really be a DECIMAL type with 2 decimal
places. In the next test, a date type is validated to make sure all values in that
column are truly dates. Finally, if requested, unicode columns are tested to see
if they contain only Latin characters. This test is called extended Unicode analysis.
The function can be used on columns of any character data type. Columns of
non-character data type are passed along to the output as defined in the input data.
Note:
This function is available in Vantage Analytic Library 2.0.0.2.
PARAMETERS:
data:
Required Argument.
Specifies the input data to perform text analysis.
Types: teradataml DataFrame
columns:
Required Argument.
Specifies the name(s) of the column(s) to analyze. Occasionally, it can also
accept permitted strings to specify all columns, or all character columns.
Permitted Values:
* Name(s) of the column(s) in "data".
* Pre-defined strings:
* 'all' - all columns
* 'allcharacter' - all character columns
Types: str OR list of Strings (str)
exclude_columns:
Optional Argument.
Specifies the name(s) of the column(s) to exclude from the analysis, if a
column specifier such as 'all', 'allcharacter' is used in the "columns" argument.
Types: str OR list of Strings (str)
analyze_numerics:
Optional Argument.
Specifies whether to process specific numeric types. If True, the function
processes numeric types.
Default Value: True
Types: bool
analyze_unicode:
Optional Argument.
Specifies whether a column declared to contain Unicode characters actually
contains only Latin characters. If True, Unicode analysis is performed.
Default Value: False
Types: bool
gen_sql_only:
Optional Argument.
Specifies whether to generate only SQL for the function.
When set to True, function SQL is generated, not executed, which can be accessed
using show_query() method, otherwise SQL is just executed but not returned.
Default Value: False
Types: bool
RETURNS:
An instance of TextAnalyzer.
Output teradataml DataFrames can be accessed using attribute references, such as
TextAnalyzerObj.<attribute_name>.
Output teradataml DataFrame attribute names are:
1.result
2.data_type_matrix
RAISES:
TeradataMlException, TypeError, ValueError
EXAMPLES:
# Notes:
# 1. To execute Vantage Analytic Library functions,
# a. import "valib" object from teradataml.
# b. set 'configure.val_install_location' to the database name where Vantage
# analytic library functions are installed.
# 2. Datasets used in these examples can be loaded using Vantage Analytic Library
# installer.
# Import valib object from teradataml to execute this function.
from teradataml import valib
# Set the 'configure.val_install_location' variable.
from teradataml import configure
configure.val_install_location = "SYSLIB"
# Create required teradataml DataFrame.
df = DataFrame("customer")
print(df)
# Example 1: Perform text analysis on all columns with only required arguments
# and default values for "analyze_numerics" and "analyze_unicode" arguments.
obj = valib.TextAnalyzer(data=df,
columns="all")
# Print the results.
print(obj.result)
print(obj.data_type_matrix)
# Example 2: Perform text analysis, including numeric and unicode analysis, on
# columns 'cust_id', 'gender' and 'marital_status'.
obj = valib.TextAnalyzer(data=df,
columns=["cust_id", "gender", "marital_status"],
analyze_numerics=True,
analyze_unicode=True)
# Print the results.
print(obj.result)
print(obj.data_type_matrix)
# Example 6: Generate only SQL for the function, but do not execute the same.
obj = valib.TextAnalyzer(data=df,
columns="all",
gen_sql_only=True)
# Print the generated SQL.
print(obj.show_query("sql"))
# Print both generated SQL and stored procedure call.
print(obj.show_query("both"))
# Print the stored procedure call.
print(obj.show_query())
print(obj.show_query("sp"))
|