Correlation Example: Include PARTITION BY

Correlation Example: Include PARTITION BY - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product

Teradata Vantage

Release Number

8.10

1.1

Published

October 2019

Language

English (United States)

Last Update

2019-12-31

dita:mapPath

ima1540829771750.ditamap

dita:ditavalPath

jsj1481748799576.ditaval

dita:id

B700-4003

lifecycle

Product Category

Teradata Vantage™

This example uses the PARTITION BY clause to group the data by state. The function calculates the correlations between columns separately for California (CA) and Texas (TX) data.

Input

The input table, corr_input, is sample macroeconomic data for the states of California and Texas over a period of 16 years (1947-1962). The GDP (gross domestic product) numbers are in millions of dollars ($M). GDPdeflator is GDP data normalized to the year 1954 that is, GDP is 100 for 1954). The other columns represent the number of people (in thousands) who were employed, unemployed, or in the armed forces.

corr_input
year	state	gdpdeflator	gdp	unemployed	armedforces	employed
1947	CA	64.52	234.289	235.6	159	60.323
1948	CA	71.45	259.426	232.5	145.6	61.122
1949	CA	71.07	258.054	368.2	161.6	60.171
1950	CA	78.38	284.599	335.1	165	61.187
1951	CA	90.6	328.975	209.9	309.9	63.221
1952	CA	95.56	346.999	193.2	359.4	63.639
1953	CA	100.63	365.385	187	354.7	64.989
1954	CA	100	363.112	357.8	335	63.761
1955	TX	109.46	397.469	290.4	304.8	66.019
1956	TX	115.44	419.18	282.2	285.7	67.857
1957	TX	121.94	442.769	293.6	279.8	68.169
1958	TX	122.43	444.546	468.1	263.7	66.513
1959	TX	132.94	482.704	381.3	255.2	68.655
1960	TX	138.41	502.601	393.1	251.4	69.564
1961	TX	142.7	518.173	480.6	257.2	69.331
1962	TX	152.82	554.894	400.7	282.7	70.551

SQL Call

The function calculates the correlation between each pair of columns in the TargetColumns syntax element. This example compares GDP to GDPdeflator, the employed population to GDP, the number of people unemployed, and the number of people in the armed forces.

SELECT * FROM CorrelationReduce (
  ON CorrelationMap (
    ON corr_input PARTITION BY state
    USING
    TargetColumns ('[2:3]','employed:gdp','employed:unemployed','employed:armedforces')
    KeyName ('keys')
    GroupByColumns('state')
  ) PARTITION BY keys,state
) AS dt;

Output

Because GDP and GDPdeflator represent the same data but with different scaling, their correlation is 1. The correlation coefficients for all column pairs are shown below.

state corr_item1  corr_item2  value_col
----- ----------- ----------- --------------------
ca    gdpdeflator gdp         1.0
ca    armedforces employed    0.9528257846832275
ca    unemployed  employed    -0.43761759996414185
tx    gdpdeflator gdp         0.9999999403953552
tx    armedforces employed    -0.4519846439361572
tx    unemployed  employed    0.3207692801952362
ca    gdp         employed    0.967695415019989
tx    gdp         employed    0.9127569198608398
ca    gdp         gdp         1.0
tx    gdp         gdp         1.0
ca    gdpdeflator gdpdeflator 1.0
tx    gdpdeflator gdpdeflator 1.0

Download a zip file of all examples and a SQL script file that creates their input tables from the attachment in the left sidebar.