This example uses the PARTITION BY clause to group the data by state. The function calculates the correlations between columns separately for California (CA) and Texas (TX) data.
Input
The input table, corr_input, is sample macroeconomic data for the states of California and Texas over a period of 16 years (1947-1962). The GDP (gross domestic product) numbers are in millions of dollars ($M). GDPdeflator is GDP data normalized to the year 1954 that is, GDP is 100 for 1954). The other columns represent the number of people (in thousands) who were employed, unemployed, or in the armed forces.
year | state | gdpdeflator | gdp | unemployed | armedforces | employed |
---|---|---|---|---|---|---|
1947 | CA | 64.52 | 234.289 | 235.6 | 159 | 60.323 |
1948 | CA | 71.45 | 259.426 | 232.5 | 145.6 | 61.122 |
1949 | CA | 71.07 | 258.054 | 368.2 | 161.6 | 60.171 |
1950 | CA | 78.38 | 284.599 | 335.1 | 165 | 61.187 |
1951 | CA | 90.6 | 328.975 | 209.9 | 309.9 | 63.221 |
1952 | CA | 95.56 | 346.999 | 193.2 | 359.4 | 63.639 |
1953 | CA | 100.63 | 365.385 | 187 | 354.7 | 64.989 |
1954 | CA | 100 | 363.112 | 357.8 | 335 | 63.761 |
1955 | TX | 109.46 | 397.469 | 290.4 | 304.8 | 66.019 |
1956 | TX | 115.44 | 419.18 | 282.2 | 285.7 | 67.857 |
1957 | TX | 121.94 | 442.769 | 293.6 | 279.8 | 68.169 |
1958 | TX | 122.43 | 444.546 | 468.1 | 263.7 | 66.513 |
1959 | TX | 132.94 | 482.704 | 381.3 | 255.2 | 68.655 |
1960 | TX | 138.41 | 502.601 | 393.1 | 251.4 | 69.564 |
1961 | TX | 142.7 | 518.173 | 480.6 | 257.2 | 69.331 |
1962 | TX | 152.82 | 554.894 | 400.7 | 282.7 | 70.551 |
SQL Call
The function calculates the correlation between each pair of columns in the TargetColumns argument. This example compares GDP to GDPdeflator, the employed population to GDP, the number of people unemployed, and the number of people in the armed forces.
SELECT * FROM CorrelationReduce ( ON CorrelationMap ( ON corr_input PARTITION BY state USING TargetColumns('[2:3]', 'employed:gdp', 'employed:unemployed', 'employed:armedforces') KeyName ('keys') GroupByColumns ('state') ) PARTITION BY keys, state ) AS dt;
Output
Because GDP and GDPdeflator represent the same data but with different scaling, their correlation is 1. The correlation coefficients for all column pairs are shown below.
state | corr | value_col |
---|---|---|
CA | gdp:gdp | 1 |
CA | gdpdeflator:gdpdeflator | 1 |
CA | gdpdeflator:gdp | 1 |
CA | employed:gdp | 0.967695 |
CA | employed:armedforces | 0.952826 |
CA | employed:unemployed | -0.437618 |
TX | gdp:gdp | 1 |
TX | gdpdeflator:gdpdeflator | 1 |
TX | gdpdeflator:gdp | 1 |
TX | employed:gdp | 0.912757 |
TX | employed:unemployed | 0.32077 |
TX | employed:armedforces | -0.451985 |