Description
When working with character data it is useful to determine the data type
and what data can be stored in the database. The td_text_analyzer()
analyzes character data and distinguishes if the field is a numeric type,
date, time, timestamp, or character data and returns two output results,
one containing the analysis results and the second one containing the column
data type matrix having the progression of data type through the series of
steps (mentioned below).
The function runs a series of tests to distinguish what the correct
underlying type of each selected column should be.
The first test performed on the column is the MIN and the MAX test. The MIN and MAX values of a column are retrieved from the database and tested to determine what type the values are.
The next test is a sample test which retrieves a small sample of data for each column and again assesses what type they should be.
The next test, extended Numeric analysis, is for fields that were determined to be numeric and it tries to classify them in a more specific category if possible. For instance, a column that is considered a FLOAT type after the first two tests might really be a DECIMAL type with 2 decimal places.
In the next test, a date type is validated to make sure all values in that column are truly dates.
If requested, unicode columns are tested to see if they contain only Latin characters. This test is called extended Unicode analysis.
The function can be used on columns of any character data type. Columns of
non-character data type are passed along to the output as defined in the
input data.
Note:
This function is available in Vantage Analytic Library 2.0.0.2 or later.
Usage
td_text_analyzer_valib(data, columns, ...)
Arguments
data |
Required Argument. |
columns |
Required Argument.
Types: character OR vector of Strings (character) |
... |
Specifies other arguments supported by the function as described in the 'Other Arguments' section. |
Value
Function returns an object of class "td_text_analyzer_valib"
which is a named list containing object of class "tbl_teradata".
Named list members can be referenced directly with the "$" operator
using name:
result
data.type.matrix
Other Arguments
exclude.columns
Optional Argument.
Specifies the name(s) of the column(s) to exclude
from the analysis, if a column specifier such as
'all', 'allcharacter' is used in the "columns"
argument.
Types: character OR vector of Strings (character)
analyze.numerics
Optional Argument.
Specifies whether to process specific numeric
types. If TRUE, the function processes numeric
types.
Default Value: TRUE
Types: logical
analyze.unicode
Optional Argument.
Specifies whether a column declared to contain
Unicode characters actually contains only Latin
characters. If TRUE, Unicode analysis is
performed.
Default Value: FALSE
Types: logical
Examples
# Notes:
# 1. To execute Vantage Analytic Library functions, set option
# 'val.install.location' to the database name where Vantage analytic
# library functions are installed.
# 2. Datasets used in these examples can be loaded using Vantage Analytic
# Library installer.
# Set the option 'val.install.location'.
options(val.install.location = "SYSLIB")
# Get remote data source connection.
con <- td_get_context()$connection
# Create an object of class "tbl_teradata".
df <- tbl(con, "customer")
print(df)
# Example 1: Perform text analysis on all columns with only required
# arguments and default values for "analyze.numerics" and
# "analyze.unicode" arguments.
obj <- td_text_analyzer_valib(data=df,
columns="all")
# Print the results.
print(obj$result)
print(obj$data.type.matrix)
# Example 2: Perform text analysis, including numeric and unicode analysis,
# on columns 'cust_id', 'gender' and 'marital_status'.
obj <- td_text_analyzer_valib(data=df,
columns=c("cust_id","gender","marital_status"),
analyze.numerics=TRUE,
analyze.unicode=TRUE)
# Print the results.
print(obj$result)
print(obj$data.type.matrix)