When dealing with numeric data columns, it is useful to have several statistical measures to understand the characteristics and properties of each of those numeric columns, to assess their quality, and to look for outlying values and other possible anomalies. The Statistics analysis provides several common and not so common statistical measures for numeric data columns. Extended options include additional analyses and measures such as Values, Modes, Quantiles, and Ranks.
A Statistics analysis can be performed on columns of numeric or date data type. For columns of type DATE, statistics other than count, minimum, maximum, and mean are calculated by first converting to the number of days since 1900.
Syntaxcall twm. td_analyze('Statistics','database=twm_source;tablename=twm_customer;columns=income;Optional Parameters;');
- The columns to analyze.
- The database containing the table to analyze.
- The Statistics parameter:
- Is required
- Must be the first parameter
- Is always enclosed in single quotes
- The table containing the columns to analyze.
- Use to request any of the following extended options to be calculated:
- none (default if extendedoptions is not specified)
If columns are specified with the groupby parameter, a separate analysis is performed for each value or combination of values in the specified columns. For example: groupby=gender,marital_status.
- Specifies the name of the database to contain the analysis results table.
- Specifies the name of the table to store the analysis results. If not supplied, the results are returned as a result set.
When overwrite is set to true (default), the output tables are dropped before creating new ones.
- The calcuated statistic as specified by either population or sample. For example, statisicalmethod=population or statisticalmethod=sample can be specified. Population statistics is the default if neither is specified.
- The basic statistics to be calculated if the Statistics analysis is performed. Shortened aliases can be used instead of the lengthier names (for example, statsoptions=all or statsoptions=cnt,min,max,mean,std). Available statistics include:
- count (cnt)
- minimum (min)
- maximum (max)
- standarddeviation (std)
- skewness (skew)
- kurtosis (kurt)
- standarderror (ste)
- coefficientofvariance (cv)
- variance (var)
- uncorrectedsumofsquares (uss)
- correctedsumofsquares (css)
The optional WHERE clause to filter the data to process. For example:
where=cust_id > 0
To execute the provided examples, the td_analyze function must be installed in a database called twm and the Teradata Warehouse Miner tutorial data must be installed in the twm_source database.
These examples demonstrate the invocation of the Statistics analysis with minimal parameters. The statistics calculated are count, min, max, mean, and standard deviation.
The following example produces an output table with group-by and where clause.
call twm.td_analyze('Statistics','database=twm_source;tablename=twm_customer;columns=income;outputdatabase=twm_results;outputtablename=_twm_statistics;groupby=gender;where=income > 0;');
The following example demonstrates the selection of all statistical measures and extended options.
The following example demonstrates the selection of individual statistical measures and extended options, returning sample statistics.