1.0 - 8.00 - Best-Match Mode Example 2: Input Values of Type INTEGER - Teradata Vantage

Teradata® Vantage Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
1.0
8.00
Release Date
May 2019
Content Type
Programming Reference
Publication ID
B700-4003-098K
Language
English (United States)

This example finds the best matching distribution for a data set that consists of integers that are from different sources and are output by different distributions.

Input

The input table, age_distribution, contains the ages of children visiting three amusement parks during a one-week period in spring and another one-week period in summer.

age_distribution
season park_name age
Spring Funland 12
Spring Funland 1
Spring Funland 10
Spring Funland 2
Spring Funland 10
Spring Funland 3
Spring Funland 3
Spring Funland 7
Spring Funland 11
Spring Funland 8
... ... ...
Spring Wonderland 10
Spring Wonderland 12
Spring Wonderland 9
Spring Wonderland 12
Spring Wonderland 10
Spring Wonderland 8
Spring Wonderland 8
Spring Wonderland 8
Spring Wonderland 9
Spring Wonderland 8
... ... ...
Spring KidsWorld 7
Spring KidsWorld 6
Spring KidsWorld 5
Spring KidsWorld 6
Spring KidsWorld 4
Spring KidsWorld 1
Spring KidsWorld 8
Spring KidsWorld 6
Spring KidsWorld 3
Spring KidsWorld 4
... ... ...
Summer Funland 10
Summer Funland 6
Summer Funland 10
Summer Funland 7
Summer Funland 4
Summer Funland 8
Summer Funland 6
Summer Funland 7
Summer Funland 4
Summer Funland 10
... ... ...
Summer Wonderland 7
Summer Wonderland 9
Summer Wonderland 9
Summer Wonderland 2
Summer Wonderland 4
Summer Wonderland 8
Summer Wonderland 5
Summer Wonderland 4
Summer Wonderland 7
Summer Wonderland 11
... ... ...
Summer KidsWorld 5
Summer KidsWorld 5
Summer KidsWorld 3
Summer KidsWorld 8
Summer KidsWorld 4
Summer KidsWorld 7
Summer KidsWorld 6
Summer KidsWorld 3
Summer KidsWorld 2
Summer KidsWorld 0
... ... ...

SQL Call

SELECT * FROM DistributionMatchReduce (
  ON DistributionMatchMultiInput (
    ON (
      SELECT COUNT(1) AS counts,
        SUM(counts) OVER (PARTITION BY park_name, season
          ORDER BY age ROWS UNBOUNDED PRECEDING) AS "rank", park_name, season, age
      FROM age_distribution
      WHERE age IS NOT NULL
      GROUP BY park_name, season, age
    ) AS "input" PARTITION BY ANY
    ON (
      SELECT park_name, season, COUNT(*) AS group_size, AVG(age) AS mean,
        STDDEV_SAMP(age) AS sd, MAX(age) AS "maximum", MIN(age) AS "minimum"
      FROM age_distribution
      WHERE age IS NOT NULL
      GROUP BY park_name, season
    ) AS groupstats DIMENSION
    USING
    ValueColumn ('age')
    Tests ('KS', 'AD', 'CHISQ')
    GroupByColumns ('park_name', 'season')
    MinGroupSize (50)
    NumCell (10)
  ) PARTITION BY park_name, season
) AS dt;

Output

The function has attempted to identify the best matching distribution for each partition of the data, based on each test specified in the SQL call. For each partition, the output shows the distribution and parameters identified by each test with the associated p-value.

park_name season group_size best_match_ks_top1 p_value_ks_top1 best_match_ad_top1 p_value_ad_top1 best_match_chisq_top1 p_value_chisq_top1
KidsWorld Spring 400 UNIFORMDISCRETE:1,8 0.963945 BINOMIAL:100,0.5 1.5e-06 BINOMIAL:100,0.5 0
KidsWorld Summer 400 POISSON:4.457499980926514 0.996074 BINOMIAL:100,0.5 1.5e-06 BINOMIAL:100,0.5 0
Funland Spring 400 UNIFORMDISCRETE:1,12 0.999293 NEGATIVEBINOMIAL:7,0.46705571565376613 1.5e-06 UNIFORMDISCRETE:1,12 8.9484e-13
Funland Summer 400 POISSON:7.065000057220459 0.000672657 BINOMIAL:13,0.507505492077339 1.5e-06 BINOMIAL:13,0.507505492077339 0
Wonderland Spring 400 BINOMIAL:16,0.5091542759742608 0.0624169 BINOMIAL:16,0.5091542759742608 1.5e-06 BINOMIAL:16,0.5091542759742608 0
Wonderland Summer 400 POISSON:7.005000114440918 0.993134 POISSON:7.005000114440918 6.47109e-06 BINOMIAL:110,0.06332936686652757 0