Correlation Example: Include PARTITION BY - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
8.10
1.1
Published
October 2019
Language
English (United States)
Last Update
2019-12-31
dita:mapPath
ima1540829771750.ditamap
dita:ditavalPath
jsj1481748799576.ditaval
dita:id
B700-4003
lifecycle
previous
Product Category
Teradata Vantageā„¢

This example uses the PARTITION BY clause to group the data by state. The function calculates the correlations between columns separately for California (CA) and Texas (TX) data.

Input

The input table, corr_input, is sample macroeconomic data for the states of California and Texas over a period of 16 years (1947-1962). The GDP (gross domestic product) numbers are in millions of dollars ($M). GDPdeflator is GDP data normalized to the year 1954 that is, GDP is 100 for 1954). The other columns represent the number of people (in thousands) who were employed, unemployed, or in the armed forces.

corr_input
year state gdpdeflator gdp unemployed armedforces employed
1947 CA 64.52 234.289 235.6 159 60.323
1948 CA 71.45 259.426 232.5 145.6 61.122
1949 CA 71.07 258.054 368.2 161.6 60.171
1950 CA 78.38 284.599 335.1 165 61.187
1951 CA 90.6 328.975 209.9 309.9 63.221
1952 CA 95.56 346.999 193.2 359.4 63.639
1953 CA 100.63 365.385 187 354.7 64.989
1954 CA 100 363.112 357.8 335 63.761
1955 TX 109.46 397.469 290.4 304.8 66.019
1956 TX 115.44 419.18 282.2 285.7 67.857
1957 TX 121.94 442.769 293.6 279.8 68.169
1958 TX 122.43 444.546 468.1 263.7 66.513
1959 TX 132.94 482.704 381.3 255.2 68.655
1960 TX 138.41 502.601 393.1 251.4 69.564
1961 TX 142.7 518.173 480.6 257.2 69.331
1962 TX 152.82 554.894 400.7 282.7 70.551

SQL Call

The function calculates the correlation between each pair of columns in the TargetColumns syntax element. This example compares GDP to GDPdeflator, the employed population to GDP, the number of people unemployed, and the number of people in the armed forces.

SELECT * FROM CorrelationReduce (
  ON CorrelationMap (
    ON corr_input PARTITION BY state
    USING
    TargetColumns ('[2:3]','employed:gdp','employed:unemployed','employed:armedforces')
    KeyName ('keys')
    GroupByColumns('state')
  ) PARTITION BY keys,state
) AS dt;

Output

Because GDP and GDPdeflator represent the same data but with different scaling, their correlation is 1. The correlation coefficients for all column pairs are shown below.

state corr_item1  corr_item2  value_col
----- ----------- ----------- --------------------
ca    gdpdeflator gdp         1.0
ca    armedforces employed    0.9528257846832275
ca    unemployed  employed    -0.43761759996414185
tx    gdpdeflator gdp         0.9999999403953552
tx    armedforces employed    -0.4519846439361572
tx    unemployed  employed    0.3207692801952362
ca    gdp         employed    0.967695415019989
tx    gdp         employed    0.9127569198608398
ca    gdp         gdp         1.0
tx    gdp         gdp         1.0
ca    gdpdeflator gdpdeflator 1.0
tx    gdpdeflator gdpdeflator 1.0

Download a zip file of all examples and a SQL script file that creates their input tables from the attachment in the left sidebar.