In logistic regression, the dependent variable (Y) has only two possible values (0 and 1, 'yes' and 'no', or 'true' and 'false'). The algorithm applies the model to the data and predicts the most likely outcome.
Input
The input table, admissions_train, contains data about applicants to an academic program. For each applicant, attributes in the table include a Masters Degree indicator, a grade point average (on a 4.0 scale), a statistical skills indicator, a programming skills indicator, and an indicator of whether the applicant was admitted. The Masters Degree, statistical skills, and programming skills indicators are categorical variables. Masters degree has two categories (yes or no), while the other two have three categories (Novice, Beginner and Advanced). For admitted status, 1 indicates that the student was admitted and 0 indicates otherwise.
id | masters | gpa | stats | programming | admitted |
---|---|---|---|---|---|
1 | yes | 3.95 | Beginner | Beginner | 0 |
2 | yes | 3.76 | Beginner | Beginner | 0 |
3 | no | 3.7 | Novice | Beginner | 1 |
4 | yes | 3.5 | Beginner | Novice | 1 |
5 | no | 3.44 | Novice | Novice | 0 |
6 | yes | 3.5 | Beginner | Advanced | 1 |
7 | yes | 2.33 | Novice | Novice | 1 |
8 | no | 3.6 | Beginner | Advanced | 1 |
9 | no | 3.82 | Advanced | Advanced | 1 |
10 | no | 3.71 | Advanced | Advanced | 1 |
11 | no | 3.13 | Advanced | Advanced | 1 |
12 | no | 3.65 | Novice | Novice | 1 |
13 | no | 4 | Advanced | Novice | 1 |
14 | yes | 3.45 | Advanced | Advanced | 0 |
15 | yes | 4 | Advanced | Advanced | 1 |
16 | no | 3.7 | Advanced | Advanced | 1 |
17 | no | 3.83 | Advanced | Advanced | 1 |
18 | yes | 3.81 | Advanced | Advanced | 1 |
19 | yes | 1.98 | Advanced | Advanced | 0 |
20 | yes | 3.9 | Advanced | Advanced | 1 |
21 | no | 3.87 | Novice | Beginner | 1 |
22 | yes | 3.46 | Novice | Beginner | 0 |
23 | yes | 3.59 | Advanced | Novice | 1 |
24 | no | 1.87 | Advanced | Novice | 1 |
25 | no | 3.96 | Advanced | Advanced | 1 |
26 | yes | 3.57 | Advanced | Advanced | 1 |
27 | yes | 3.96 | Advanced | Advanced | 0 |
28 | no | 3.93 | Advanced | Advanced | 1 |
29 | yes | 4 | Novice | Beginner | 0 |
30 | yes | 3.79 | Advanced | Novice | 0 |
31 | yes | 3.5 | Advanced | Beginner | 1 |
32 | yes | 3.46 | Advanced | Beginner | 0 |
33 | no | 3.55 | Novice | Novice | 1 |
34 | yes | 3.85 | Advanced | Beginner | 0 |
35 | no | 3.68 | Novice | Beginner | 1 |
36 | no | 3 | Advanced | Novice | 0 |
37 | no | 3.52 | Novice | Novice | 1 |
38 | yes | 2.65 | Advanced | Beginner | 1 |
39 | yes | 3.75 | Advanced | Beginner | 0 |
40 | yes | 3.95 | Novice | Beginner | 0 |
SQL Call
The default option is to include the intercept with the step argument set to false. The response variable (admitted, in this example) must be specified as the first variable listed in the InputColumns argument, followed by the other predictors.
DROP TABLE glm_admissions_model; SELECT * FROM GLM ( ON admissions_train AS InputTable OUT TABLE OutputTable (glm_admissions_model) USING InputColumns ('admitted','masters', 'gpa', 'stats', 'programming') CategoricalColumns ('masters', 'stats', 'programming') Family ('LOGISTIC') LinkFunction ('LOGIT') WeightColumn ('1') StopThreshold (0.01) MaxIterNum (25) Step ('false') Intercept ('true') ) AS dt;
Output
The output table shows the model statistics.
predictor | estimate | std_error | z_score | p_value | significance |
---|---|---|---|---|---|
(Intercept) | 1.07751 | 2.92076 | 0.368914 | 0.712192 | |
masters.no | 2.21655 | 1.01999 | 2.17311 | 0.0297719 | * |
gpa | -0.113935 | 0.802573 | -0.141962 | 0.88711 | |
stats.Novice | 0.0406848 | 1.11567 | 0.0364667 | 0.97091 | |
stats.Beginner | 0.526618 | 1.2229 | 0.430631 | 0.666736 | |
programming.Beginner | -1.76976 | 1.069 | -1.65553 | 0.0978177 | . |
programming.Novice | -0.98035 | 1.14004 | -0.859923 | 0.389831 | |
ITERATIONS # | 4 | 0 | 0 | 0 | Number of Fisher Scoring iterations |
ROWS # | 40 | 0 | 0 | 0 | Number of rows |
Residual deviance | 38.9038 | 0 | 0 | 0 | on 33 degrees of freedom |
Pearson goodness of fit | 37.7905 | 0 | 0 | 0 | on 33 degrees of freedom |
AIC | 52.9038 | 0 | 0 | 0 | Akaike information criterion |
BIC | 64.726 | 0 | 0 | 0 | Bayesian information criterion |
Wald Test | 9.89642 | 0 | 0 | 0.19452 | |
Dispersion parameter | 1 | 0 | 0 | 0 | Taken to be 1 for BINOMIAL and POISSON. |
For categorical variables, the model selects a reference category. This example uses the Advanced category as a reference for the stats variable.
This query returns the following table:
SELECT * FROM glm_admissions_model ORDER BY attribute;
attribute | predictor | category | estimate | std_err | z_score | p_value | significance | family |
---|---|---|---|---|---|---|---|---|
-1 | Loglik | -19.45190048 | 40 | 6 | 0 | LOGISTIC | ||
0 | (Intercept) | 1.077509999 | 2.920759916 | 0.368914008 | 0.712191999 | LOGISTIC | ||
1 | masters | yes | LOGISTIC | |||||
2 | masters | no | 2.216550112 | 1.019989967 | 2.173110008 | 0.0297719 | * | LOGISTIC |
3 | gpa | -0.113935001 | 0.802573025 | -0.141962007 | 0.887109995 | LOGISTIC | ||
4 | stats | advanced | LOGISTIC | |||||
5 | stats | novice | 0.040684801 | 1.115669966 | 0.036466699 | 0.970910013 | LOGISTIC | |
6 | stats | beginner | 0.526618004 | 1.222900033 | 0.430631012 | 0.666736007 | LOGISTIC | |
7 | programming | advanced | LOGISTIC | |||||
8 | programming | beginner | -1.769760013 | 1.069000006 | -1.655529976 | 0.097817697 | . | LOGISTIC |
9 | programming | novice | -0.980350018 | 1.14004004 | -1.655529976 | 0.389831007 | LOGISTIC |