Correlation Example 1: Include PARTITION BY Clause

Correlation Example 1: Include PARTITION BY Clause - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product

Teradata Vantage

Release Number

8.00

1.0

Published

May 2019

Language

English (United States)

Last Update

2019-11-22

dita:mapPath

blj1506016597986.ditamap

dita:ditavalPath

blj1506016597986.ditaval

dita:id

B700-4003

lifecycle

Product Category

Teradata Vantage™

This example uses the PARTITION BY clause to group the data by state. The function calculates the correlations between columns separately for California (CA) and Texas (TX) data.

Input

The input table, corr_input, is sample macroeconomic data for the states of California and Texas over a period of 16 years (1947-1962). The GDP (gross domestic product) numbers are in millions of dollars ($M). GDPdeflator is GDP data normalized to the year 1954 that is, GDP is 100 for 1954). The other columns represent the number of people (in thousands) who were employed, unemployed, or in the armed forces.

corr_input
year	state	gdpdeflator	gdp	unemployed	armedforces	employed
1947	CA	64.52	234.289	235.6	159	60.323
1948	CA	71.45	259.426	232.5	145.6	61.122
1949	CA	71.07	258.054	368.2	161.6	60.171
1950	CA	78.38	284.599	335.1	165	61.187
1951	CA	90.6	328.975	209.9	309.9	63.221
1952	CA	95.56	346.999	193.2	359.4	63.639
1953	CA	100.63	365.385	187	354.7	64.989
1954	CA	100	363.112	357.8	335	63.761
1955	TX	109.46	397.469	290.4	304.8	66.019
1956	TX	115.44	419.18	282.2	285.7	67.857
1957	TX	121.94	442.769	293.6	279.8	68.169
1958	TX	122.43	444.546	468.1	263.7	66.513
1959	TX	132.94	482.704	381.3	255.2	68.655
1960	TX	138.41	502.601	393.1	251.4	69.564
1961	TX	142.7	518.173	480.6	257.2	69.331
1962	TX	152.82	554.894	400.7	282.7	70.551

SQL Call

The function calculates the correlation between each pair of columns in the TargetColumns argument. This example compares GDP to GDPdeflator, the employed population to GDP, the number of people unemployed, and the number of people in the armed forces.

SELECT * FROM CorrelationReduce (
  ON CorrelationMap (
    ON corr_input PARTITION BY state
    USING 
    TargetColumns('[2:3]', 'employed:gdp', 'employed:unemployed', 'employed:armedforces')
    KeyName ('keys')
    GroupByColumns ('state')
  ) PARTITION BY keys, state
) AS dt;

Output

Because GDP and GDPdeflator represent the same data but with different scaling, their correlation is 1. The correlation coefficients for all column pairs are shown below.

state	corr	value_col
CA	gdp:gdp	1
CA	gdpdeflator:gdpdeflator	1
CA	gdpdeflator:gdp	1
CA	employed:gdp	0.967695
CA	employed:armedforces	0.952826
CA	employed:unemployed	-0.437618
TX	gdp:gdp	1
TX	gdpdeflator:gdpdeflator	1
TX	gdpdeflator:gdp	1
TX	employed:gdp	0.912757
TX	employed:unemployed	0.32077
TX	employed:armedforces	-0.451985