Purpose
The Data Explorer performs basic statistical analysis on a set of selected tables, or on selected columns from selected tables, in one or more databases. It stores results from four fundamental types of analysis based on simplified versions of the Descriptive Statistics analyses:
- Values
- Statistics
- Frequency
- Histogram
An answer table is produced for each requested type of analysis, the output including requested database, table and column names in order to allow results from multiple tables to be included in each answer table. Each analysis can be selected individually, with the following exceptions:
- If Frequency is selected, Values must be selected
- If Histogram is selected, Values and Statistics must be selected, including the Count, Minimum, Maximum, Mean and Standard Deviation.
The Data Explorer includes intelligence about which functions should be performed on which columns, with decisions based partly on column type and partly on obtained results. It also includes performance enhancements in the generated SQL to help minimize passes through the data. A separate SQL Where Clause can optionally be specified for each of the input tables selected for analysis.
A Values analysis can be performed on columns of any data type, though the measures displayed vary according to the column type.
A Statistics analysis can be performed on columns of numeric or date data type. For columns of type DATE, statistics other than count, minimum, maximum and mean are calculated by first converting to the number of days since 1900.
Generally, a Frequency analysis can be performed on columns of any data type except types that hold byte data (for example, BYTE and VARBYTE type data).
A Histogram analysis can be performed on columns of numeric or date data type.
Syntax
call twm. td_analyze('DataExplorer','database=twm_source;tablename=twm_customer;outputdatabase=twm_results;Optional Parameters;');Required Parameters
- database
- The database or databases containing the table or tables to be analyzed.
- DataExplorer
- The DataExplorer parameter:
- Is required
- Must be the first parameter
- Is always enclosed in single quotes
- outputdatabase
- The results of the analysis will be stored in one or more tables in this database.
- tablename
- The table or tables containing the columns to be analyzed. For example, tablename=t1,t2,t3 indicates that three tables (t1, t2 and t3) are to be analyzed. Note that the number of databases in the database parameter and the number of tables in the tablename parameter must be the same, as described for the database parameter.
Optional Parameters
- bins
- The number of bins to create if the Histogram analysis is performed. The default is 10.
- binstyle
- The style of binning to perform if the Histogram analysis is performed. The possibilities are binstyle=bins (default) and binstyle=quantiles.
- columns
- The columns to be analyzed. For example, columns=c1,c2,c3. If columns=all is entered, all columns in the input table are analyzed.
- maxnumcombvalues
- The maximum number of combined values for frequency or histogram analysis. The default is 10000.
- maxuniquecharvalues
- The maximum number of unique character values for unrestricted frequency analysis. The default value is 100.
- maxuniquenumvalues
- The maximum number of unique numeric or date values for frequency analysis. The default is 20.
- minrowsforcomb
- Minimum number of rows before frequency or histogram combining attempted. The default is 25000.
- restrictedfreqproc
- Performs restricted frequency processing including prominent values. The default is true.
- restrictedthreshold
- The minimum fraction of rows frequency value must occur in for restricted frequency processing to be performed. The default is 1.
- statisticalmethod
- Determines if Population or Sample statistics should be calculated. For example, statisicalmethod=population or statisticalmethod=sample can be specified. Population statistics is the default if neither is specified.
- statsoptions
- Requests the basic statistics to be calculated if the Statistics analysis is performed. For convenience, shortened aliases can be used instead of the lengthier names (for example, statsoptions=all or statsoptions=cnt,min,max,mean,std). Available statistics include:
- all
- none
- count (cnt)
- minimum (min)
- maximum (max)
- mean
- standarddeviation (std)
- skewness (skew)
- kurtosis (kurt)
- standarderror (ste)
- coefficientofvariance (cv)
- variance (var)
- sum
- uncorrectedsumofsquares (uss)
- correctedsumofsquares (css)
- uniques
- When set to true, unique values are counted for each selected column. By default, unique values are set to true and not counted.
- where
-
An optional WHERE clause that can be used to filter the data to be processed. For example:
where=cust_id > 0
Examples
The examples in this section demonstrate the performance of the valid combinations of analyses with a single input table. These examples assume that the td_analyze function has been installed in a database named twm.
This example performs all analyses (Values, Statistics, Frequency and Histogram) and creates the output tables for each of these analyses with default names.
call twm.td_analyze('DataExplorer','database=twm_source;tablename=twm_customer;outputdatabase=twm_results;');
This example performs only the Values analysis, naming the Values output table v1.
call twm.td_analyze('DataExplorer','database=twm_source;tablename=twm_customer;outputdatabase=twm_results;valuesoutputtablename=v1;');
This example performs only the Statistics analysis, naming the Statistics output table s1.
call twm.td_analyze('DataExplorer','database=twm_source;tablename=twm_customer;outputdatabase=twm_results;statisticsoutputtablename=s1;');
This example performs a Values and Statistics analysis, naming the output tables v1 and s1 respectively.
call twm.td_analyze('DataExplorer','database=twm_source;tablename=twm_customer;outputdatabase=twm_results;valuesoutputtablename=v1;statisticsoutputtablename=s1;');
This example performs both a Values and Frequency analysis, naming the output tables v1 and f1 respectively.
call twm.td_analyze('DataExplorer','database=twm_source;tablename=twm_customer;outputdatabase=twm_results;valuesoutputtablename=v1;frequencyoutputtablename=f1;');
This example performs a Values, Statistics and Histogram analysis, naming the output tables v1, s1 and h1 respectively.
call twm.td_analyze('DataExplorer','database=twm_source;tablename=twm_customer;outputdatabase=twm_results;valuesoutputtablename=v1;statisticsoutputtablename=s1;histogramoutputtablename=h1;');
This example performs one of each of the four analyses, Values, Statistics, Frequency and Histogram, and names the output tables v1, s1, f1 and h1 respectively.
call twm.td_analyze('DataExplorer','database=twm_source;tablename=twm_customer;outputdatabase=twm_results;valuesoutputtablename=v1;statisticsoutputtablename=s1;frequencyoutputtablename=f1;histogramoutputtablename=h1;');