Distribution Matching Hypothesis-Test Mode Example | Teradata Vantage - Hypothesis-Test Mode Example: Include GroupByColumns - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
9.02
9.01
2.0
1.3
Published
February 2022
Language
English (United States)
Last Update
2022-02-10
dita:mapPath
rnn1580259159235.ditamap
dita:ditavalPath
ybt1582220416951.ditaval
dita:id
B700-4003
lifecycle
previous
Product Category
Teradata Vantageā„¢

This example shows the use of grouping columns, and also illustrates the syntax for testing against multiple distributions in a single SQL command.

Input

The input table, factory_7, represents hypothetical mean-time-to-failure data for two products. This is a subset of the rows:

factory_7
product mttf
A 10039.5
A 9926.6
A 9971.34
A 9868.7
A 9940.17
A 10266.7
A 9768.64
A 10043.2
A 10133.7
A 9731.33
... ...
D 9721.21
D 10068.6
D 9952
D 9851.94
D 10378.3
D 9908.9
D 9749.43
D 10448
D 9681.25
D 10147.5
... ...

SQL Call

The function call evaluates two possible distributions (normal and uniform) and applies the Kolmogorov-Smirnov (KS) and Anderson-Darling (AD) fit tests.

SELECT * FROM DistributionMatchReduce (
	ON DistributionMatchMultiInput (
		ON (SELECT RANK() OVER (PARTITION BY  product
				ORDER BY mttf) AS "rank", product, mttf
			FROM factory_7 
			WHERE mttf IS NOT NULL) AS InputTable PARTITION BY ANY
		ON (SELECT product, COUNT(*) AS group_size 
			FROM factory_7 
			WHERE mttf IS NOT NULL 
			GROUP BY product) AS GroupStatistics DIMENSION
		USING
		TargetColumn('mttf')
		TESTS('KS', 'AD')
		DISTRIBUTIONS('NORMAL:10000,150','UNIFORMCONTINUOUS:9500,10500')
		GroupByColumns('product')
		MINGROUPSIZE(50)
	) PARTITION BY product
)as dt ;

Output

The reported p-values support these conclusions:
  • For product A:
    • Both tests fail to reject the null hypothesis that the data fit a normal distribution with the specified parameters.
    • Both tests reject the null hypothesis that the data fit the specified uniform distribution.
  • For product D:
    • Both tests fail to reject the null hypothesis that the data fit a uniform distribution with the specified parameters.
    • Both tests reject the null hypothesis that the data fit the specified normal distribution.

In the output table column names, when 'a' and 'b' appear between digits, interpret them as comma (,) and period (.), respectively.

 product group_size normal$10000a150_ks_statistic normal$10000a150_ks_p_value normal$10000a150_ad_statistic normal$10000a150_ad_p_value uniformcontinuous$9500a10500_ks_statistic uniformcontinuous$9500a10500_ks_p_value uniformcontinuous$9500a10500_ad_statistic uniformcontinuous$9500a10500_ad_p_value 
 ------- ---------- ----------------------------- --------------------------- ----------------------------- --------------------------- ----------------------------------------- --------------------------------------- ----------------------------------------- --------------------------------------- 
 A             2999           0.01148039847612381         0.44052770733833313            0.5279074311256409         0.17789512872695923                        0.2148340940475464                                     0.0                         365.3498229980469                   2.0006669387839793E-7
 D             3000            0.2077179253101349                         0.0             886.3829956054688                         0.0                      0.007739999797195196                      0.9938256740570068                         0.442442923784256                      0.8058283925056458

Download a zip file of all examples and a SQL script file that creates their input tables.