Teradata Package for R Function Reference | 17.00 - 17.00 - td_text_analyzer_valib - Teradata Package for R

Teradata® Package for R Function Reference

Product
Teradata Package for R
Release Number
17.00
Release Date
July 2021
Content Type
Programming Reference
Publication ID
B700-4007-090K
Language
English (United States)

Description

When working with character data it is useful to determine the data type and what data can be stored in the database. The td_text_analyzer() analyzes character data and distinguishes if the field is a numeric type, date, time, timestamp, or character data and returns two output results, one containing the analysis results and the second one containing the column data type matrix having the progression of data type through the series of steps (mentioned below).

The function runs a series of tests to distinguish what the correct underlying type of each selected column should be.

  1. The first test performed on the column is the MIN and the MAX test. The MIN and MAX values of a column are retrieved from the database and tested to determine what type the values are.

  2. The next test is a sample test which retrieves a small sample of data for each column and again assesses what type they should be.

  3. The next test, extended Numeric analysis, is for fields that were determined to be numeric and it tries to classify them in a more specific category if possible. For instance, a column that is considered a FLOAT type after the first two tests might really be a DECIMAL type with 2 decimal places.

  4. In the next test, a date type is validated to make sure all values in that column are truly dates.

  5. If requested, unicode columns are tested to see if they contain only Latin characters. This test is called extended Unicode analysis.


The function can be used on columns of any character data type. Columns of non-character data type are passed along to the output as defined in the input data.
Note:

  • This function is available in Vantage Analytic Library 2.0.0.2 or later.

Usage

td_text_analyzer_valib(data, columns, ...)

Arguments

data

Required Argument.
Specifies the input data to perform text analysis.
Types: tbl_teradata

columns

Required Argument.
Specifies the name(s) of the column(s) to analyze. Occasionally, it can also accept permitted strings to specify all columns, or all character columns.
Permitted Values:

  1. Name(s) of the column(s) in "data".

  2. Pre-defined strings:

    1. 'all' - all columns

    2. 'allcharacter' - all character columns

Types: character OR vector of Strings (character)

...

Specifies other arguments supported by the function as described in the 'Other Arguments' section.

Value

Function returns an object of class "td_text_analyzer_valib" which is a named list containing object of class "tbl_teradata".
Named list members can be referenced directly with the "$" operator using name:

  1. result

  2. data.type.matrix

Other Arguments

exclude.columns

Optional Argument.
Specifies the name(s) of the column(s) to exclude from the analysis, if a column specifier such as 'all', 'allcharacter' is used in the "columns" argument.
Types: character OR vector of Strings (character)

analyze.numerics

Optional Argument.
Specifies whether to process specific numeric types. If TRUE, the function processes numeric types.
Default Value: TRUE
Types: logical

analyze.unicode

Optional Argument.
Specifies whether a column declared to contain Unicode characters actually contains only Latin characters. If TRUE, Unicode analysis is performed.
Default Value: FALSE
Types: logical

Examples

# Notes:
#   1. To execute Vantage Analytic Library functions, set option 
#      'val.install.location' to the database name where Vantage analytic 
#      library functions are installed.
#   2. Datasets used in these examples can be loaded using Vantage Analytic 
#      Library installer.

# Set the option 'val.install.location'.
options(val.install.location = "SYSLIB")

# Get remote data source connection.
con <- td_get_context()$connection

# Create an object of class "tbl_teradata".
df <- tbl(con, "customer")
print(df)

# Example 1: Perform text analysis on all columns with only required 
#            arguments and default values for "analyze.numerics" and 
#            "analyze.unicode" arguments.
obj <- td_text_analyzer_valib(data=df,
                              columns="all")

# Print the results.
print(obj$result)
print(obj$data.type.matrix)

# Example 2: Perform text analysis, including numeric and unicode analysis, 
#            on columns 'cust_id', 'gender' and 'marital_status'.
obj <- td_text_analyzer_valib(data=df,
                              columns=c("cust_id","gender","marital_status"),
                              analyze.numerics=TRUE,
                              analyze.unicode=TRUE)

# Print the results.
print(obj$result)
print(obj$data.type.matrix)