This example finds the best matching distribution for a data set that consists of integers that are from different sources and are output by different distributions.
Input
The input table, age_distribution, contains the ages of children visiting three amusement parks during a one-week period in spring and another one-week period in summer.
season | park_name | age |
---|---|---|
Spring | Funland | 12 |
Spring | Funland | 1 |
Spring | Funland | 10 |
Spring | Funland | 2 |
Spring | Funland | 10 |
Spring | Funland | 3 |
Spring | Funland | 3 |
Spring | Funland | 7 |
Spring | Funland | 11 |
Spring | Funland | 8 |
... | ... | ... |
Spring | Wonderland | 10 |
Spring | Wonderland | 12 |
Spring | Wonderland | 9 |
Spring | Wonderland | 12 |
Spring | Wonderland | 10 |
Spring | Wonderland | 8 |
Spring | Wonderland | 8 |
Spring | Wonderland | 8 |
Spring | Wonderland | 9 |
Spring | Wonderland | 8 |
... | ... | ... |
Spring | KidsWorld | 7 |
Spring | KidsWorld | 6 |
Spring | KidsWorld | 5 |
Spring | KidsWorld | 6 |
Spring | KidsWorld | 4 |
Spring | KidsWorld | 1 |
Spring | KidsWorld | 8 |
Spring | KidsWorld | 6 |
Spring | KidsWorld | 3 |
Spring | KidsWorld | 4 |
... | ... | ... |
Summer | Funland | 10 |
Summer | Funland | 6 |
Summer | Funland | 10 |
Summer | Funland | 7 |
Summer | Funland | 4 |
Summer | Funland | 8 |
Summer | Funland | 6 |
Summer | Funland | 7 |
Summer | Funland | 4 |
Summer | Funland | 10 |
... | ... | ... |
Summer | Wonderland | 7 |
Summer | Wonderland | 9 |
Summer | Wonderland | 9 |
Summer | Wonderland | 2 |
Summer | Wonderland | 4 |
Summer | Wonderland | 8 |
Summer | Wonderland | 5 |
Summer | Wonderland | 4 |
Summer | Wonderland | 7 |
Summer | Wonderland | 11 |
... | ... | ... |
Summer | KidsWorld | 5 |
Summer | KidsWorld | 5 |
Summer | KidsWorld | 3 |
Summer | KidsWorld | 8 |
Summer | KidsWorld | 4 |
Summer | KidsWorld | 7 |
Summer | KidsWorld | 6 |
Summer | KidsWorld | 3 |
Summer | KidsWorld | 2 |
Summer | KidsWorld | 0 |
... | ... | ... |
SQL Call
SELECT * FROM DistributionMatchReduce ( ON DistributionMatchMultiInput ( ON ( SELECT COUNT(1) AS counts, SUM(counts) OVER (PARTITION BY park_name, season ORDER BY age ROWS UNBOUNDED PRECEDING) AS "rank", park_name, season, age FROM age_distribution WHERE age IS NOT NULL GROUP BY park_name, season, age ) AS InputTable PARTITION BY ANY ON ( SELECT park_name, season, COUNT(*) AS group_size, AVG(age) AS mean, STDDEV_SAMP(age) AS sd, MAX(age) AS "maximum", MIN(age) AS "minimum" FROM age_distribution WHERE age IS NOT NULL GROUP BY park_name, season ) AS GroupStatistics DIMENSION USING TargetColumn('age') TESTS('KS', 'AD', 'CHISQ') GroupByColumns('park_name', 'season') MINGROUPSIZE(50) NumCell(10) ) PARTITION BY park_name, season )AS dt;
Output
The function has attempted to identify the best matching distribution for each partition of the data, based on each test specified in the SQL call. For each partition, the output shows the distribution and parameters identified by each test with the associated p-value.
park_name season group_size best_match_ks_top1 p_value_ks_top1 best_match_ad_top1 p_value_ad_top1 best_match_chisq_top1 p_value_chisq_top1 ---------- ------ ---------- ------------------------------ -------------------- -------------------------------------- -------------------- -------------------------------- --------------------- wonderland spring 400 binomial:16,0.5091542759742608 0.06241687387228012 binomial:16,0.5091542759742608 1.500000053056283E-6 binomial:16,0.5091542759742608 0.0 funland spring 400 uniformdiscrete:1,12 0.9992926716804504 negativebinomial:7,0.46705571565376613 1.500000053056283E-6 uniformdiscrete:1,12 8.948397578478762E-13 wonderland summer 400 poisson:7.005000114440918 0.9931338429450989 poisson:7.005000114440918 6.47109345663921E-6 binomial:110,0.06332936686652757 0.0 funland summer 400 poisson:7.065000057220459 6.726568099111319E-4 binomial:13,0.507505492077339 1.500000053056283E-6 binomial:13,0.507505492077339 0.0 kidsworld spring 400 uniformdiscrete:1,8 0.9639452695846558 binomial:100,0.5 1.500000053056283E-6 binomial:100,0.5 0.0 kidsworld summer 400 poisson:4.457499980926514 0.9960735440254211 binomial:100,0.5 1.500000053056283E-6 binomial:100,0.5 0.0
Download a zip file of all examples and a SQL script file that creates their input tables.