GLM Example: Logistic Regression Analysis with Intercept - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
8.10
1.1
Published
October 2019
Language
English (United States)
Last Update
2019-12-31
dita:mapPath
ima1540829771750.ditamap
dita:ditavalPath
jsj1481748799576.ditaval
dita:id
B700-4003
lifecycle
previous
Product Category
Teradata Vantageā„¢

In logistic regression, the dependent variable (Y) has only two possible values (0 and 1, 'yes' and 'no', or 'true' and 'false'). The algorithm applies the model to the data and predicts the most likely outcome.

Input

The InputTable, admissions_train, contains data about applicants to an academic program. For each applicant, attributes in the table include a Masters Degree indicator, a grade point average (on a 4.0 scale), a statistical skills indicator, a programming skills indicator, and an indicator of whether the applicant was admitted. The Masters Degree, statistical skills, and programming skills indicators are categorical variables. Masters degree has two categories (yes or no), while the other two have three categories (Novice, Beginner and Advanced). For admitted status, 1 indicates that the student was admitted and 0 indicates otherwise.

InputTable: admissions_train
id masters gpa stats programming admitted
1 yes 3.95 Beginner Beginner 0
2 yes 3.76 Beginner Beginner 0
3 no 3.7 Novice Beginner 1
4 yes 3.5 Beginner Novice 1
5 no 3.44 Novice Novice 0
6 yes 3.5 Beginner Advanced 1
7 yes 2.33 Novice Novice 1
8 no 3.6 Beginner Advanced 1
9 no 3.82 Advanced Advanced 1
10 no 3.71 Advanced Advanced 1
11 no 3.13 Advanced Advanced 1
12 no 3.65 Novice Novice 1
13 no 4 Advanced Novice 1
14 yes 3.45 Advanced Advanced 0
15 yes 4 Advanced Advanced 1
16 no 3.7 Advanced Advanced 1
17 no 3.83 Advanced Advanced 1
18 yes 3.81 Advanced Advanced 1
19 yes 1.98 Advanced Advanced 0
20 yes 3.9 Advanced Advanced 1
21 no 3.87 Novice Beginner 1
22 yes 3.46 Novice Beginner 0
23 yes 3.59 Advanced Novice 1
24 no 1.87 Advanced Novice 1
25 no 3.96 Advanced Advanced 1
26 yes 3.57 Advanced Advanced 1
27 yes 3.96 Advanced Advanced 0
28 no 3.93 Advanced Advanced 1
29 yes 4 Novice Beginner 0
30 yes 3.79 Advanced Novice 0
31 yes 3.5 Advanced Beginner 1
32 yes 3.46 Advanced Beginner 0
33 no 3.55 Novice Novice 1
34 yes 3.85 Advanced Beginner 0
35 no 3.68 Novice Beginner 1
36 no 3 Advanced Novice 0
37 no 3.52 Novice Novice 1
38 yes 2.65 Advanced Beginner 1
39 yes 3.75 Advanced Beginner 0
40 yes 3.95 Novice Beginner 0

SQL Call

The response variable (admitted, in this example) must be specified as the first variable listed in the TargetColumns syntax element, followed by the other predictors.

DROP TABLE glm_admissions_model;

SELECT * FROM GLM (
  ON admissions_train AS InputTable
  OUT TABLE OutputTable (glm_admissions_model)
  USING
  TargetColumns ('admitted','masters', 'gpa', 'stats', 'programming')
  CategoricalColumns ('masters', 'stats', 'programming')
  Family ('LOGISTIC')
  LinkFunction ('LOGIT')
  WeightColumn ('1')
  StopThreshold (0.01)
  MaxIterNum (25)
  Intercept ('true')
) AS dt;

Output

The output table shows the model statistics.

 predictor               estimate             std_error          z_score              p_value              significance                            
 ----------------------- -------------------- ------------------ -------------------- -------------------- --------------------------------------- 
 (Intercept)               1.0775099992752075  2.920759916305542  0.36891400814056396   0.7121919989585876                                        
 masters.no                  2.21655011177063 1.0199899673461914    2.173110008239746 0.029771899804472923 *                                      
 gpa                     -0.11393500119447708  0.802573025226593 -0.14196200668811798   0.8871099948883057                                        
 stats.novice             0.04068480059504509 1.1156699657440186 0.036466699093580246   0.9709100127220154                                        
 stats.beginner            0.5266180038452148 1.2229000329971313  0.43063101172447205   0.6667360067367554                                        
 programming.beginner      -1.769760012626648  1.069000005722046  -1.6555299758911133  0.09781769663095474 .                                      
 programming.novice       -0.9803500175476074 1.1400400400161743  -0.8599230051040649    0.389831006526947                                        
 ITERATIONS #                             4.0                0.0                  0.0                  0.0 Number of Fisher Scoring iterations    
 ROWS #                                  40.0                0.0                  0.0                  0.0 Number of rows                         
 Residual deviance          38.90380096435547                0.0                  0.0                  0.0 on 33 degrees of freedom               
 Pearson goodness of fit    37.79050064086914                0.0                  0.0                  0.0 on 33 degrees of freedom               
 AIC                        52.90380096435547                0.0                  0.0                  0.0 Akaike information criterion           
 BIC                        64.72595977783203                0.0                  0.0                  0.0 Bayesian information criterion         
 Wald Test                  9.896419525146484                0.0                  0.0  0.19451963901519775                                        
 Dispersion parameter                     1.0                0.0                  0.0                  0.0 Taken to be 1 for BINOMIAL and POISSON.

For categorical variables, the model selects a reference category. This example uses the Advanced category as a reference for the stats variable.

This query returns the following table:

SELECT * FROM glm_admissions_model;
 attribute predictor   category estimate             std_err            z_score              p_value              significance family   
 --------- ----------- -------- -------------------- ------------------ -------------------- -------------------- ------------ -------- 
        -1 Loglik      NULL      -19.451900482177734               40.0                  6.0                  0.0 NULL         LOGISTIC
         0 (Intercept) NULL       1.0775099992752075  2.920759916305542  0.36891400814056396   0.7121919989585876              LOGISTIC
         1 masters     yes                      NULL               NULL                 NULL                 NULL NULL         LOGISTIC
         2 masters     no           2.21655011177063 1.0199899673461914    2.173110008239746 0.029771899804472923 *            LOGISTIC
         3 gpa         NULL     -0.11393500119447708  0.802573025226593 -0.14196200668811798   0.8871099948883057              LOGISTIC
         4 stats       advanced                 NULL               NULL                 NULL                 NULL NULL         LOGISTIC
         5 stats       novice    0.04068480059504509 1.1156699657440186 0.036466699093580246   0.9709100127220154              LOGISTIC
         6 stats       beginner   0.5266180038452148 1.2229000329971313  0.43063101172447205   0.6667360067367554              LOGISTIC
         7 programming advanced                 NULL               NULL                 NULL                 NULL NULL         LOGISTIC
         8 programming beginner   -1.769760012626648  1.069000005722046  -1.6555299758911133  0.09781769663095474 .            LOGISTIC
         9 programming novice    -0.9803500175476074 1.1400400400161743  -0.8599230051040649    0.389831006526947              LOGISTIC

Download a zip file of all examples and a SQL script file that creates their input tables from the attachment in the left sidebar.