Text Field Analysis - Teradata Warehouse Miner

Purpose

When working with character data it is useful to determine the data type and what data can be stored in the database. The Text Field Analyzer analyzes character data and distinguishes if the field is a numeric type, date, time, timestamp, or character data. Use the Text Field Analyzer to any type of character data. Non-character data types go unprocessed and are passed along to the output as defined in the input table.

Using a table name and the name of one or more columns, the Text Field Analyzer provides a series of tests to distinguish what the correct underlying type of each selected column should be. The first test performed on the column is the MIN and the MAX test. The MIN and MAX values of a column are retrieved from the database and tested to determine what type the values are. The next test is a sample test which retrieves a small sample of data for each column and again assesses what type they should be. The next test is for fields that were determined to be numeric and it tries to classify them in a more specific category if possible. For instance, a column that is considered a FLOAT type after the first two tests might really be a DECIMAL type with 2 decimal places (the extended numeric test is optional but by default is performed). In the next test, a date type is validated to make sure all values in that column are truly dates. Finally, if requested, Unicode columns are tested to see if they contain only Latin characters. The extended Unicode analysis is optional and by default is not performed.

A Text Field Analyzer can be performed on columns of any character data type. Columns of non-character data type are passed along to the output as defined in the input table.

Syntax

call twm. td_analyze('TextFieldAnalyzer','database=twm_source;tablename=twm_customer;columns=all;outputdatabase=twm_results;outputtablename=_twm_text_field_analysis;Optional Parameters;');

Required Parameters

columns

The columns to analyze. For example, columns=c1,c2,c3. When columns=all is entered, all columns in the input table are analyzed.

database

The database containing the table to analyze.

outputdatabase

The Text Field Analyzer generates SQL to create a table with columns of proposed data types. The outputdatabase parameter is used in this generated SQL to fully qualify the name of the output table with the output database. The SQL is not run by the analysis.

outputtablename

The Text Field Analyzer generates SQL to create a table with columns of proposed data types. The outputtablename parameter is the name of the output table. The SQL is not run by the analysis.

tablename

The table containing the columns to analyze.

TextFieldAnalyzer

The TextFieldAnalzer parameter:

Is required
Must be the first parameter
Is always enclosed in single quotes

Optional Parameters

extendednumericanalysis: Determines whether to process specific numeric types. By default extendednumericanalysis=true to process extends numeric analysis.
extendedunicodeanalysis: Determines if a column declared to contain Unicode characters actually contains only Latin characters is enabled by setting the extendedunicodeanalysis parameter to true. By default, extended Unicode analysis is not performed. This is equivalent to extendedunicodeanalysis=false.
overwrite: When overwrite is set to true (default), the output tables are dropped before creating new ones.

Examples

These examples demonstrate the invocation of the Text Field Analyzer. To run the provided examples, the td_analyze function must be installed in a database called twm and the TWM tutorial data must be installed in the twm_source database.

This example includes only required parameters and relies on default values to request extended numeric analysis and not to request extended Unicode analysis.

call twm.td_analyze('TextFieldAnalyzer','database=twm_source;tablename=twm_customer;columns=all;outputdatabase=twm_results;outputtablename=_twm_text_field_analysis;');

This example is functionally equivalent to the first example but explicitly sets the options for extended numeric analysis and extended Unicode analysis to their default values (true and false).

call twm.td_analyze('TextFieldAnalyzer','database=twm_source;tablename=twm_customer;columns=cust_id,gender,marital_status;outputdatabase=twm_results;outputtablename=_twm_text_field_analysis2;extendednumericanalysis=true;extendedunicodeanalysis=false;');

This example requests extended Unicode analysis without extended numeric analysis.

call twm.td_analyze('TextFieldAnalyzer','database=twm;tablename=TWMX_ADSParameters;columns=all;outputdatabase=twm_results;outputtablename=_twm_text_field_analysis3;extendednumericanalysis=false;extendedunicodeanalysis=true;');