1.0 - 8.00 - GLM Example 1: Logistic Regression Analysis with Intercept - Teradata Vantage

Teradata® Vantage Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
1.0
8.00
Release Date
May 2019
Content Type
Programming Reference
Publication ID
B700-4003-098K
Language
English (United States)

In logistic regression, the dependent variable (Y) has only two possible values (0 and 1, 'yes' and 'no', or 'true' and 'false'). The algorithm applies the model to the data and predicts the most likely outcome.

Input

The input table, admissions_train, contains data about applicants to an academic program. For each applicant, attributes in the table include a Masters Degree indicator, a grade point average (on a 4.0 scale), a statistical skills indicator, a programming skills indicator, and an indicator of whether the applicant was admitted. The Masters Degree, statistical skills, and programming skills indicators are categorical variables. Masters degree has two categories (yes or no), while the other two have three categories (Novice, Beginner and Advanced). For admitted status, 1 indicates that the student was admitted and 0 indicates otherwise.

admissions_train
id masters gpa stats programming admitted
1 yes 3.95 Beginner Beginner 0
2 yes 3.76 Beginner Beginner 0
3 no 3.7 Novice Beginner 1
4 yes 3.5 Beginner Novice 1
5 no 3.44 Novice Novice 0
6 yes 3.5 Beginner Advanced 1
7 yes 2.33 Novice Novice 1
8 no 3.6 Beginner Advanced 1
9 no 3.82 Advanced Advanced 1
10 no 3.71 Advanced Advanced 1
11 no 3.13 Advanced Advanced 1
12 no 3.65 Novice Novice 1
13 no 4 Advanced Novice 1
14 yes 3.45 Advanced Advanced 0
15 yes 4 Advanced Advanced 1
16 no 3.7 Advanced Advanced 1
17 no 3.83 Advanced Advanced 1
18 yes 3.81 Advanced Advanced 1
19 yes 1.98 Advanced Advanced 0
20 yes 3.9 Advanced Advanced 1
21 no 3.87 Novice Beginner 1
22 yes 3.46 Novice Beginner 0
23 yes 3.59 Advanced Novice 1
24 no 1.87 Advanced Novice 1
25 no 3.96 Advanced Advanced 1
26 yes 3.57 Advanced Advanced 1
27 yes 3.96 Advanced Advanced 0
28 no 3.93 Advanced Advanced 1
29 yes 4 Novice Beginner 0
30 yes 3.79 Advanced Novice 0
31 yes 3.5 Advanced Beginner 1
32 yes 3.46 Advanced Beginner 0
33 no 3.55 Novice Novice 1
34 yes 3.85 Advanced Beginner 0
35 no 3.68 Novice Beginner 1
36 no 3 Advanced Novice 0
37 no 3.52 Novice Novice 1
38 yes 2.65 Advanced Beginner 1
39 yes 3.75 Advanced Beginner 0
40 yes 3.95 Novice Beginner 0

SQL Call

The default option is to include the intercept with the step argument set to false. The response variable (admitted, in this example) must be specified as the first variable listed in the InputColumns argument, followed by the other predictors.

DROP TABLE glm_admissions_model;

SELECT * FROM GLM (
  ON admissions_train AS InputTable
  OUT TABLE OutputTable (glm_admissions_model)
  USING
  InputColumns ('admitted','masters', 'gpa', 'stats', 'programming')
  CategoricalColumns ('masters', 'stats', 'programming')
  Family ('LOGISTIC')
  LinkFunction ('LOGIT')
  WeightColumn ('1')
  StopThreshold (0.01)
  MaxIterNum (25)
  Step ('false')
  Intercept ('true')
) AS dt;

Output

The output table shows the model statistics.

Model Statistics
predictor estimate std_error z_score p_value significance
(Intercept) 1.07751 2.92076 0.368914 0.712192  
masters.no 2.21655 1.01999 2.17311 0.0297719 *
gpa -0.113935 0.802573 -0.141962 0.88711  
stats.Novice 0.0406848 1.11567 0.0364667 0.97091  
stats.Beginner 0.526618 1.2229 0.430631 0.666736  
programming.Beginner -1.76976 1.069 -1.65553 0.0978177 .
programming.Novice -0.98035 1.14004 -0.859923 0.389831  
ITERATIONS # 4 0 0 0 Number of Fisher Scoring iterations
ROWS # 40 0 0 0 Number of rows
Residual deviance 38.9038 0 0 0 on 33 degrees of freedom
Pearson goodness of fit 37.7905 0 0 0 on 33 degrees of freedom
AIC 52.9038 0 0 0 Akaike information criterion
BIC 64.726 0 0 0 Bayesian information criterion
Wald Test 9.89642 0 0 0.19452  
Dispersion parameter 1 0 0 0 Taken to be 1 for BINOMIAL and POISSON.

For categorical variables, the model selects a reference category. This example uses the Advanced category as a reference for the stats variable.

This query returns the following table:

SELECT * FROM glm_admissions_model ORDER BY attribute;
glm_admissions_model
attribute predictor category estimate std_err z_score p_value significance family
-1 Loglik   -19.45190048 40 6 0   LOGISTIC
0 (Intercept)   1.077509999 2.920759916 0.368914008 0.712191999   LOGISTIC
1 masters yes           LOGISTIC
2 masters no 2.216550112 1.019989967 2.173110008 0.0297719 * LOGISTIC
3 gpa   -0.113935001 0.802573025 -0.141962007 0.887109995   LOGISTIC
4 stats advanced           LOGISTIC
5 stats novice 0.040684801 1.115669966 0.036466699 0.970910013   LOGISTIC
6 stats beginner 0.526618004 1.222900033 0.430631012 0.666736007   LOGISTIC
7 programming advanced           LOGISTIC
8 programming beginner -1.769760013 1.069000006 -1.655529976 0.097817697 . LOGISTIC
9 programming novice -0.980350018 1.14004004 -1.655529976 0.389831007   LOGISTIC