GLM Example 1: Logistic Regression Analysis with Intercept

GLM Example 1: Logistic Regression Analysis with Intercept - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product

Teradata Vantage

Release Number

8.00

1.0

Published

May 2019

Language

English (United States)

Last Update

2019-11-22

dita:mapPath

blj1506016597986.ditamap

dita:ditavalPath

blj1506016597986.ditaval

dita:id

B700-4003

lifecycle

Product Category

Teradata Vantage™

In logistic regression, the dependent variable (Y) has only two possible values (0 and 1, 'yes' and 'no', or 'true' and 'false'). The algorithm applies the model to the data and predicts the most likely outcome.

Input

The input table, admissions_train, contains data about applicants to an academic program. For each applicant, attributes in the table include a Masters Degree indicator, a grade point average (on a 4.0 scale), a statistical skills indicator, a programming skills indicator, and an indicator of whether the applicant was admitted. The Masters Degree, statistical skills, and programming skills indicators are categorical variables. Masters degree has two categories (yes or no), while the other two have three categories (Novice, Beginner and Advanced). For admitted status, 1 indicates that the student was admitted and 0 indicates otherwise.

admissions_train
id	masters	gpa	stats	programming	admitted
1	yes	3.95	Beginner	Beginner	0
2	yes	3.76	Beginner	Beginner	0
3	no	3.7	Novice	Beginner	1
4	yes	3.5	Beginner	Novice	1
5	no	3.44	Novice	Novice	0
6	yes	3.5	Beginner	Advanced	1
7	yes	2.33	Novice	Novice	1
8	no	3.6	Beginner	Advanced	1
9	no	3.82	Advanced	Advanced	1
10	no	3.71	Advanced	Advanced	1
11	no	3.13	Advanced	Advanced	1
12	no	3.65	Novice	Novice	1
13	no	4	Advanced	Novice	1
14	yes	3.45	Advanced	Advanced	0
15	yes	4	Advanced	Advanced	1
16	no	3.7	Advanced	Advanced	1
17	no	3.83	Advanced	Advanced	1
18	yes	3.81	Advanced	Advanced	1
19	yes	1.98	Advanced	Advanced	0
20	yes	3.9	Advanced	Advanced	1
21	no	3.87	Novice	Beginner	1
22	yes	3.46	Novice	Beginner	0
23	yes	3.59	Advanced	Novice	1
24	no	1.87	Advanced	Novice	1
25	no	3.96	Advanced	Advanced	1
26	yes	3.57	Advanced	Advanced	1
27	yes	3.96	Advanced	Advanced	0
28	no	3.93	Advanced	Advanced	1
29	yes	4	Novice	Beginner	0
30	yes	3.79	Advanced	Novice	0
31	yes	3.5	Advanced	Beginner	1
32	yes	3.46	Advanced	Beginner	0
33	no	3.55	Novice	Novice	1
34	yes	3.85	Advanced	Beginner	0
35	no	3.68	Novice	Beginner	1
36	no	3	Advanced	Novice	0
37	no	3.52	Novice	Novice	1
38	yes	2.65	Advanced	Beginner	1
39	yes	3.75	Advanced	Beginner	0
40	yes	3.95	Novice	Beginner	0

SQL Call

The default option is to include the intercept with the step argument set to false. The response variable (admitted, in this example) must be specified as the first variable listed in the InputColumns argument, followed by the other predictors.

DROP TABLE glm_admissions_model;

SELECT * FROM GLM (
  ON admissions_train AS InputTable
  OUT TABLE OutputTable (glm_admissions_model)
  USING
  InputColumns ('admitted','masters', 'gpa', 'stats', 'programming')
  CategoricalColumns ('masters', 'stats', 'programming')
  Family ('LOGISTIC')
  LinkFunction ('LOGIT')
  WeightColumn ('1')
  StopThreshold (0.01)
  MaxIterNum (25)
  Step ('false')
  Intercept ('true')
) AS dt;

Output

The output table shows the model statistics.

Model Statistics
predictor	estimate	std_error	z_score	p_value	significance
(Intercept)	1.07751	2.92076	0.368914	0.712192
masters.no	2.21655	1.01999	2.17311	0.0297719	*
gpa	-0.113935	0.802573	-0.141962	0.88711
stats.Novice	0.0406848	1.11567	0.0364667	0.97091
stats.Beginner	0.526618	1.2229	0.430631	0.666736
programming.Beginner	-1.76976	1.069	-1.65553	0.0978177	.
programming.Novice	-0.98035	1.14004	-0.859923	0.389831
ITERATIONS #	4	0	0	0	Number of Fisher Scoring iterations
ROWS #	40	0	0	0	Number of rows
Residual deviance	38.9038	0	0	0	on 33 degrees of freedom
Pearson goodness of fit	37.7905	0	0	0	on 33 degrees of freedom
AIC	52.9038	0	0	0	Akaike information criterion
BIC	64.726	0	0	0	Bayesian information criterion
Wald Test	9.89642	0	0	0.19452
Dispersion parameter	1	0	0	0	Taken to be 1 for BINOMIAL and POISSON.

For categorical variables, the model selects a reference category. This example uses the Advanced category as a reference for the stats variable.

This query returns the following table:

SELECT * FROM glm_admissions_model ORDER BY attribute;

glm_admissions_model
attribute	predictor	category	estimate	std_err	z_score	p_value	significance	family
-1	Loglik		-19.45190048	40	6	0		LOGISTIC
0	(Intercept)		1.077509999	2.920759916	0.368914008	0.712191999		LOGISTIC
1	masters	yes						LOGISTIC
2	masters	no	2.216550112	1.019989967	2.173110008	0.0297719	*	LOGISTIC
3	gpa		-0.113935001	0.802573025	-0.141962007	0.887109995		LOGISTIC
4	stats	advanced						LOGISTIC
5	stats	novice	0.040684801	1.115669966	0.036466699	0.970910013		LOGISTIC
6	stats	beginner	0.526618004	1.222900033	0.430631012	0.666736007		LOGISTIC
7	programming	advanced						LOGISTIC
8	programming	beginner	-1.769760013	1.069000006	-1.655529976	0.097817697	.	LOGISTIC
9	programming	novice	-0.980350018	1.14004004	-1.655529976	0.389831007		LOGISTIC