Correlation Example 1: Include PARTITION BY Clause - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
8.00
1.0
Published
May 2019
Language
English (United States)
Last Update
2019-11-22
dita:mapPath
blj1506016597986.ditamap
dita:ditavalPath
blj1506016597986.ditaval
dita:id
B700-4003
lifecycle
previous
Product Category
Teradata Vantageā„¢

This example uses the PARTITION BY clause to group the data by state. The function calculates the correlations between columns separately for California (CA) and Texas (TX) data.

Input

The input table, corr_input, is sample macroeconomic data for the states of California and Texas over a period of 16 years (1947-1962). The GDP (gross domestic product) numbers are in millions of dollars ($M). GDPdeflator is GDP data normalized to the year 1954 that is, GDP is 100 for 1954). The other columns represent the number of people (in thousands) who were employed, unemployed, or in the armed forces.

corr_input
year state gdpdeflator gdp unemployed armedforces employed
1947 CA 64.52 234.289 235.6 159 60.323
1948 CA 71.45 259.426 232.5 145.6 61.122
1949 CA 71.07 258.054 368.2 161.6 60.171
1950 CA 78.38 284.599 335.1 165 61.187
1951 CA 90.6 328.975 209.9 309.9 63.221
1952 CA 95.56 346.999 193.2 359.4 63.639
1953 CA 100.63 365.385 187 354.7 64.989
1954 CA 100 363.112 357.8 335 63.761
1955 TX 109.46 397.469 290.4 304.8 66.019
1956 TX 115.44 419.18 282.2 285.7 67.857
1957 TX 121.94 442.769 293.6 279.8 68.169
1958 TX 122.43 444.546 468.1 263.7 66.513
1959 TX 132.94 482.704 381.3 255.2 68.655
1960 TX 138.41 502.601 393.1 251.4 69.564
1961 TX 142.7 518.173 480.6 257.2 69.331
1962 TX 152.82 554.894 400.7 282.7 70.551

SQL Call

The function calculates the correlation between each pair of columns in the TargetColumns argument. This example compares GDP to GDPdeflator, the employed population to GDP, the number of people unemployed, and the number of people in the armed forces.

SELECT * FROM CorrelationReduce (
  ON CorrelationMap (
    ON corr_input PARTITION BY state
    USING 
    TargetColumns('[2:3]', 'employed:gdp', 'employed:unemployed', 'employed:armedforces')
    KeyName ('keys')
    GroupByColumns ('state')
  ) PARTITION BY keys, state
) AS dt;

Output

Because GDP and GDPdeflator represent the same data but with different scaling, their correlation is 1. The correlation coefficients for all column pairs are shown below.

state corr value_col
CA gdp:gdp 1
CA gdpdeflator:gdpdeflator 1
CA gdpdeflator:gdp 1
CA employed:gdp 0.967695
CA employed:armedforces 0.952826
CA employed:unemployed -0.437618
TX gdp:gdp 1
TX gdpdeflator:gdpdeflator 1
TX gdpdeflator:gdp 1
TX employed:gdp 0.912757
TX employed:unemployed 0.32077
TX employed:armedforces -0.451985