Every complete example in this document is available in a zip file that you can download. The zip file includes a SQL script file that creates the input tables for the examples. If you are reading this document on https://docs.teradata.com/, you can download the zip file from the attachment in the left sidebar.
TD_CategoricalSummary InputTable: cat_titanic_train
passenger survived pclass name gender age sibsp parch ticket fare cabin embarked --------- -------- ------ ------------------------------------ ------ --- ----- ----- -------- --------- ----------- -------- 97 0 1 Goldschmidt; Mr. George B male 71 0 0 PC 17754 34.6542 A5 C 488 0 1 Kent; Mr. Edward Austin male 58 0 0 11771 29.7 B37 C 505 1 1 Maioni; Miss. Roberta female 16 0 0 110152 86.5 B79 S 631 1 1 Barkworth; Mr. Algernon Henry Wilson male 80 0 0 27042 30 A23 S 873 0 1 Carlsson; Mr. Frans Olof male 33 0 0 695 5 B51 B53 B55 S
Example: TD_CategoricalSummary Call for Single Column
SELECT * FROM TD_CategoricalSummary ( ON cat_titanic_train AS InputTable USING TargetColumns ('gender') ) AS dt;
TD_CategoricalSummary Output
ColumnName DistinctValue DistinctValueCount ---------- ------------- ------------------ gender female 1 gender male 4
Example: Calculating Statistics Using Multiple Columns
You can input multiple columns in the TargetColumns parameter to have these statistics calculated for all columns at the same time. embarked has been added.
SELECT * FROM TD_CategoricalSummary ( ON cat_titanic_train AS InputTable USING TargetColumns ('embarked', 'gender') ) AS dt;
TD_Categorical Columns Output
ColumnName DistinctValue DistinctValueCount ---------- ------------- ------------------ embarked S 3 gender female 1 embarked C 2 gender male 4
Example: Calculating Statistics Using Indexing
Another way to run the query is to use indexing. Instead of writing column names, you can also use indexes. Indexes start from 0 (first column is indexed by 0) and then 1, and so on.
Input data for this (passengers_2):
name gender ticket cabin embarked ------ ------ -------- ------ -------- Harris male PC 17754 A3 C Jim male PC 1754 A5 B Taha male PC 1984 A8 C Ahmed male PC 17754 A5 C John male PC 1772 A8 A
Indexing Input Query
The reason to use this data is because TD_CategoricalSummary only works on categorical columns and this dataset only has categorical columns. The previous dataset had some integer and decimal datatype columns.
Running the following query gives the stats for the zeroth, first, and second columns:
SELECT * FROM TD_CategoricalSummary ( ON passengers_2 AS InputTable USING TargetColumns ('[0:2]') ) AS dt;
Indexing Output
ColumnName DistinctValue DistinctValueCount ---------- ------------- ------------------ name Ahmed 1 name Jim 1 name Harris 1 name John 1 ticket PC 1772 1 ticket PC 17754 2 gender male 5 name Taha 1 ticket PC 1984 1 ticket PC 1754 1
Statistics for All Columns Query
SELECT * FROM TD_CategoricalSummary ( ON passengers_2 AS InputTable USING TargetColumns ('[:]') ) AS dt;
All Columns Output
ColumnName DistinctValue DistinctValueCount ---------- ------------- ------------------ cabin A3 1 name Jim 1 cabin A8 2 cabin A5 2 name Ahmed 1 ticket PC 17754 2 embarked B 1 embarked A 1 ticket PC 1772 1 name Harris 1 embarked C 3 ticket PC 1984 1 gender male 5 name John 1 name Taha 1 ticket PC 1754 1
Using indexes in the TargetColumns parameter requires entering square brackets in quotes, and the entered column indexes need to be of the CHAR or VARCHAR datatype. If they are not, the query gives an error. You select all columns by entering the following [:] in quotes on the TargetColumns parameter.