This example finds the best matching distribution for a data set that consists of integers that are from different sources and are output by different distributions.
Input
The input table, age_distribution, contains the ages of children visiting three amusement parks during a one-week period in spring and another one-week period in summer.
season | park_name | age |
---|---|---|
Spring | Funland | 12 |
Spring | Funland | 1 |
Spring | Funland | 10 |
Spring | Funland | 2 |
Spring | Funland | 10 |
Spring | Funland | 3 |
Spring | Funland | 3 |
Spring | Funland | 7 |
Spring | Funland | 11 |
Spring | Funland | 8 |
... | ... | ... |
Spring | Wonderland | 10 |
Spring | Wonderland | 12 |
Spring | Wonderland | 9 |
Spring | Wonderland | 12 |
Spring | Wonderland | 10 |
Spring | Wonderland | 8 |
Spring | Wonderland | 8 |
Spring | Wonderland | 8 |
Spring | Wonderland | 9 |
Spring | Wonderland | 8 |
... | ... | ... |
Spring | KidsWorld | 7 |
Spring | KidsWorld | 6 |
Spring | KidsWorld | 5 |
Spring | KidsWorld | 6 |
Spring | KidsWorld | 4 |
Spring | KidsWorld | 1 |
Spring | KidsWorld | 8 |
Spring | KidsWorld | 6 |
Spring | KidsWorld | 3 |
Spring | KidsWorld | 4 |
... | ... | ... |
Summer | Funland | 10 |
Summer | Funland | 6 |
Summer | Funland | 10 |
Summer | Funland | 7 |
Summer | Funland | 4 |
Summer | Funland | 8 |
Summer | Funland | 6 |
Summer | Funland | 7 |
Summer | Funland | 4 |
Summer | Funland | 10 |
... | ... | ... |
Summer | Wonderland | 7 |
Summer | Wonderland | 9 |
Summer | Wonderland | 9 |
Summer | Wonderland | 2 |
Summer | Wonderland | 4 |
Summer | Wonderland | 8 |
Summer | Wonderland | 5 |
Summer | Wonderland | 4 |
Summer | Wonderland | 7 |
Summer | Wonderland | 11 |
... | ... | ... |
Summer | KidsWorld | 5 |
Summer | KidsWorld | 5 |
Summer | KidsWorld | 3 |
Summer | KidsWorld | 8 |
Summer | KidsWorld | 4 |
Summer | KidsWorld | 7 |
Summer | KidsWorld | 6 |
Summer | KidsWorld | 3 |
Summer | KidsWorld | 2 |
Summer | KidsWorld | 0 |
... | ... | ... |
SQL Call
SELECT * FROM DistributionMatchReduce ( ON DistributionMatchMultiInput ( ON ( SELECT COUNT(1) AS counts, SUM(counts) OVER (PARTITION BY park_name, season ORDER BY age ROWS UNBOUNDED PRECEDING) AS "rank", park_name, season, age FROM age_distribution WHERE age IS NOT NULL GROUP BY park_name, season, age ) AS "input" PARTITION BY ANY ON ( SELECT park_name, season, COUNT(*) AS group_size, AVG(age) AS mean, STDDEV_SAMP(age) AS sd, MAX(age) AS "maximum", MIN(age) AS "minimum" FROM age_distribution WHERE age IS NOT NULL GROUP BY park_name, season ) AS groupstats DIMENSION USING ValueColumn ('age') Tests ('KS', 'AD', 'CHISQ') GroupByColumns ('park_name', 'season') MinGroupSize (50) NumCell (10) ) PARTITION BY park_name, season ) AS dt;
Output
The function has attempted to identify the best matching distribution for each partition of the data, based on each test specified in the SQL call. For each partition, the output shows the distribution and parameters identified by each test with the associated p-value.
park_name | season | group_size | best_match_ks_top1 | p_value_ks_top1 | best_match_ad_top1 | p_value_ad_top1 | best_match_chisq_top1 | p_value_chisq_top1 |
---|---|---|---|---|---|---|---|---|
KidsWorld | Spring | 400 | UNIFORMDISCRETE:1,8 | 0.963945 | BINOMIAL:100,0.5 | 1.5e-06 | BINOMIAL:100,0.5 | 0 |
KidsWorld | Summer | 400 | POISSON:4.457499980926514 | 0.996074 | BINOMIAL:100,0.5 | 1.5e-06 | BINOMIAL:100,0.5 | 0 |
Funland | Spring | 400 | UNIFORMDISCRETE:1,12 | 0.999293 | NEGATIVEBINOMIAL:7,0.46705571565376613 | 1.5e-06 | UNIFORMDISCRETE:1,12 | 8.9484e-13 |
Funland | Summer | 400 | POISSON:7.065000057220459 | 0.000672657 | BINOMIAL:13,0.507505492077339 | 1.5e-06 | BINOMIAL:13,0.507505492077339 | 0 |
Wonderland | Spring | 400 | BINOMIAL:16,0.5091542759742608 | 0.0624169 | BINOMIAL:16,0.5091542759742608 | 1.5e-06 | BINOMIAL:16,0.5091542759742608 | 0 |
Wonderland | Summer | 400 | POISSON:7.005000114440918 | 0.993134 | POISSON:7.005000114440918 | 6.47109e-06 | BINOMIAL:110,0.06332936686652757 | 0 |