5.4.2 - Text Field Analysis - Teradata Warehouse Miner

In-Database Analytic Functions User Guide

prodname
Teradata Warehouse Miner
vrm_release
5.4.2
created_date
October 2016
category
User Guide
featnum
B035-2306-106K

Purpose

When dealing with character data it is sometimes helpful to be able to examine this data and determine what actual data type the data could be stored in within the database. The Text Field Analyzer analysis can analyze character data and help distinguish whether the field is a numeric type, a date, a time, a timestamp, or character data. Text field analysis can readily be applied to any type of character data. Non-character data types go unprocessed and are passed along to the output just as they are defined in the input table.

Given a table name and the name of one or more columns, the Text Field Analyzer provides a series of tests to distinguish what the correct underlying type of each selected column should be. The first test performed on the column is the MIN and the MAX test. The MIN and MAX values of a column are retrieved from the database and tested to determine what type the values are. The next test is a sample test which retrieves a small sample of data for each column and again assesses what type they should be. The next test is a test for fields that have already been determined to be numeric and it tries to classify them in a more specific category if possible. For instance, a column that is considered a FLOAT type after the first two tests might really be a DECIMAL type with 2 decimal places (the extended numeric test is optional but by default is performed.) In the next test, a date type will be validated to make sure all values in that column are truly dates. Finally, if requested, Unicode columns are tested to see if they contain only Latin characters. The extended Unicode analysis is optional and by default is not performed.

A Text Field Analyzer analysis can be performed on columns of any character data type. Columns of non-character data type are passed along to the output just as they are defined in the input table.

Syntax

call twm. td_analyze('TextFieldAnalyzer','database=twm_source;tablename=twm_customer;columns=all;outputdatabase=twm_results;outputtablename=_twm_text_field_analysis;Optional Parameters;');

Required Parameters

columns
The columns to be analyzed. For example, columns=c1,c2,c3. If columns=all is entered, all columns in the input table are analyzed.
database
The database containing the table to be analyzed.
outputdatabase
The Text Field Analyzer generates SQL to create a table with columns of proposed data types. The outputdatabase parameter is used in this generated SQL to fully qualify the name of the output table with the output database. The SQL is not executed by the analysis.
outputtablename
The Text Field Analyzer generates SQL to create a table with columns of proposed data types. The outputtablename parameter is used in this generated SQL as the name of the output table. The SQL is not executed by the analysis.
tablename
The table containing the columns to be analyzed.
TextFieldAnalyzer
The TextFieldAnalzer parameter:
  • Is required
  • Must be the first parameter
  • Is always enclosed in single quotes

Optional Parameters

extendednumericanalysis
By setting the extendednumericanalysis parameter to false, the processing to determine specific numeric types is inhibited. By default, extended numeric analysis is performed. This is equivalent to extendednumericanalysis=true.
extendedunicodeanalysis
By setting the extendedunicodeanalysis parameter to true, the processing to determine if a column declared to contain Unicode characters actually contains only Latin characters is enabled. By default, extended Unicode analysis is not performed. This is equivalent to extendedunicodeanalysis=false.

Examples

The examples in this section demonstrate the invocation of the Text Field Analyzer analysis. These examples assume that the td_analyze function has been installed in a database named twm.

This example includes only required parameters and relies on default values to request extended numeric analysis and not to request extended Unicode analysis.

call twm.td_analyze('TextFieldAnalyzer','database=twm_source;tablename=twm_customer;columns=all;outputdatabase=twm_results;outputtablename=_twm_text_field_analysis;');

This example is functionally equivalent to the first example but explicitly sets the options for extended numeric analysis and extended Unicode analysis to their default values (true and false respectively).

call twm.td_analyze('TextFieldAnalyzer','database=twm_source;tablename=twm_customer;columns=cust_id,gender,marital_status;outputdatabase=twm_results;outputtablename=_twm_text_field_analysis2;extendednumericanalysis=true;extendedunicodeanalysis=false;');

This example requests extended Unicode analysis without extended numeric analysis.

call twm.td_analyze('TextFieldAnalyzer','database=twm;tablename=TWMX_ADSParameters;columns=all;outputdatabase=twm_results;outputtablename=_twm_text_field_analysis3;extendednumericanalysis=false;extendedunicodeanalysis=true;');