Text Field Analysis - Teradata Warehouse Miner

In-Database Analytic Functions User Guide

Product
Teradata Warehouse Miner
Release Number
5.4.5
Published
February 2018
Language
English (United States)
Last Update
2018-05-04
dita:mapPath
dfw1503087325991.ditamap
dita:ditavalPath
ft:empty
dita:id
B035-2306
lifecycle
previous
Product Category
Teradata® Warehouse Miner

Purpose

When dealing with character data it is helpful to be able to examine this data and determine what actual data type the data could be stored in within the database. The Text Field Analyzer can analyze character data and help distinguish whether the field is a numeric type, a date, a time, a timestamp, or character data. Text Field Analyzer can be applied to any type of character data. Non-character data types go unprocessed and are passed along to the output just as they are defined in the input table.

Given a table name and the name of one or more columns, the Text Field Analyzer provides a series of tests to distinguish what the correct underlying type of each selected column should be. The first test performed on the column is the MIN and the MAX test. The MIN and MAX values of a column are retrieved from the database and tested to determine what type the values are. The next test is a sample test which retrieves a small sample of data for each column and again assesses what type they should be. The next test is for fields that were determined to be numeric and it tries to classify them in a more specific category if possible. For instance, a column that is considered a FLOAT type after the first two tests might really be a DECIMAL type with 2 decimal places (the extended numeric test is optional but by default is performed). In the next test, a date type is validated to make sure all values in that column are truly dates. Finally, if requested, Unicode columns are tested to see if they contain only Latin characters. The extended Unicode analysis is optional and by default is not performed.

A Text Field Analyzer can be performed on columns of any character data type. Columns of non-character data type are passed along to the output just as they are defined in the input table.

Syntax

call twm. td_analyze('TextFieldAnalyzer','database=twm_source;tablename=twm_customer;columns=all;outputdatabase=twm_results;outputtablename=_twm_text_field_analysis;Optional Parameters;');

Required Parameters

columns
The columns to analyze. For example, columns=c1,c2,c3. Enter columns=all is entered, all columns in the input table are analyzed.
database
The database containing the table to analyze.
outputdatabase
The Text Field Analyzer generates SQL to create a table with columns of proposed data types. The outputdatabase parameter is used in this generated SQL to fully qualify the name of the output table with the output database. The SQL is not executed by the analysis.
outputtablename
The Text Field Analyzer generates SQL to create a table with columns of proposed data types. The outputtablename parameter is the name of the output table. The SQL is not executed by the analysis.
tablename
The table containing the columns to analyze.
TextFieldAnalyzer
The TextFieldAnalzer parameter:
  • Is required
  • Must be the first parameter
  • Is always enclosed in single quotes

Optional Parameters

extendednumericanalysis
Determines whether to process specific numeric types. By default extendednumericanalysis=true to process extends numeric analysis.
extendedunicodeanalysis
Determines if a column declared to contain Unicode characters actually contains only Latin characters is enabled by setting the extendedunicodeanalysis parameter to true. By default, extended Unicode analysis is not performed. This is equivalent to extendedunicodeanalysis=false.
overwrite

When overwrite is set to true (default), the output tables are dropped before creating new ones.

Examples

These examples demonstrate the invocation of the Text Field Analyzer. To execute the provided examples, the td_analyze function must be installed in a database called twm and the Teradata Warehouse Miner tutorial data must be installed in the twm_source database.

This example includes only required parameters and relies on default values to request extended numeric analysis and not to request extended Unicode analysis.

call twm.td_analyze('TextFieldAnalyzer','database=twm_source;tablename=twm_customer;columns=all;outputdatabase=twm_results;outputtablename=_twm_text_field_analysis;');

This example is functionally equivalent to the first example but explicitly sets the options for extended numeric analysis and extended Unicode analysis to their default values (true and false).

call twm.td_analyze('TextFieldAnalyzer','database=twm_source;tablename=twm_customer;columns=cust_id,gender,marital_status;outputdatabase=twm_results;outputtablename=_twm_text_field_analysis2;extendednumericanalysis=true;extendedunicodeanalysis=false;');

This example requests extended Unicode analysis without extended numeric analysis.

call twm.td_analyze('TextFieldAnalyzer','database=twm;tablename=TWMX_ADSParameters;columns=all;outputdatabase=twm_results;outputtablename=_twm_text_field_analysis3;extendednumericanalysis=false;extendedunicodeanalysis=true;');