# Generalized Linear Models (Formula)ΒΆ

This notebook illustrates how you can use R-style formulas to fit Generalized Linear Models.

To begin, we load the Star98 dataset and we construct a formula and pre-process the data:

In [1]:
from __future__ import print_function
import statsmodels.api as sm
import statsmodels.formula.api as smf
formula = 'SUCCESS ~ LOWINC + PERASIAN + PERBLACK + PERHISP + PCTCHRT + \
PCTYRRND + PERMINTE*AVYRSEXP*AVSALK + PERSPENK*PTRATIO*PCTAF'
dta = star98[['NABOVE', 'NBELOW', 'LOWINC', 'PERASIAN', 'PERBLACK', 'PERHISP',
'PCTCHRT', 'PCTYRRND', 'PERMINTE', 'AVYRSEXP', 'AVSALK',
'PERSPENK', 'PTRATIO', 'PCTAF']]
endog = dta['NABOVE'] / (dta['NABOVE'] + dta.pop('NBELOW'))
del dta['NABOVE']
dta['SUCCESS'] = endog


Then, we fit the GLM model:

In [2]:
mod1 = smf.glm(formula=formula, data=dta, family=sm.families.Binomial()).fit()
mod1.summary()

Out[2]:
Dep. Variable: No. Observations: SUCCESS 303 GLM 282 Binomial 20 logit 1.0 IRLS -189.70 Tue, 02 Dec 2014 380.66 12:53:02 8.48 7
coef std err z P>|z| [95.0% Conf. Int.] 0.4037 25.036 0.016 0.987 -48.665 49.472 -0.0204 0.010 -1.982 0.048 -0.041 -0.000 0.0159 0.017 0.910 0.363 -0.018 0.050 -0.0198 0.020 -1.004 0.316 -0.058 0.019 -0.0096 0.010 -0.951 0.341 -0.029 0.010 -0.0022 0.022 -0.103 0.918 -0.045 0.040 -0.0022 0.006 -0.348 0.728 -0.014 0.010 0.1068 0.787 0.136 0.892 -1.436 1.650 -0.0411 1.176 -0.035 0.972 -2.346 2.264 -0.0031 0.054 -0.057 0.954 -0.108 0.102 0.0131 0.295 0.044 0.965 -0.566 0.592 -0.0019 0.013 -0.145 0.885 -0.028 0.024 0.0008 0.020 0.038 0.970 -0.039 0.041 5.978e-05 0.001 0.068 0.946 -0.002 0.002 -0.3097 4.233 -0.073 0.942 -8.606 7.987 0.0096 0.919 0.010 0.992 -1.792 1.811 0.0066 0.206 0.032 0.974 -0.397 0.410 -0.0143 0.474 -0.030 0.976 -0.944 0.916 0.0105 0.098 0.107 0.915 -0.182 0.203 -0.0001 0.022 -0.005 0.996 -0.044 0.044 -0.0002 0.005 -0.051 0.959 -0.010 0.009

Finally, we define a function to operate customized data transformation using the formula framework:

In [3]:
def double_it(x):
return 2 * x
formula = 'SUCCESS ~ double_it(LOWINC) + PERASIAN + PERBLACK + PERHISP + PCTCHRT + \
PCTYRRND + PERMINTE*AVYRSEXP*AVSALK + PERSPENK*PTRATIO*PCTAF'
mod2 = smf.glm(formula=formula, data=dta, family=sm.families.Binomial()).fit()
mod2.summary()

Out[3]:
Dep. Variable: No. Observations: SUCCESS 303 GLM 282 Binomial 20 logit 1.0 IRLS -189.70 Tue, 02 Dec 2014 380.66 12:53:02 8.48 7
coef std err z P>|z| [95.0% Conf. Int.] 0.4037 25.036 0.016 0.987 -48.665 49.472 -0.0102 0.005 -1.982 0.048 -0.020 -0.000 0.0159 0.017 0.910 0.363 -0.018 0.050 -0.0198 0.020 -1.004 0.316 -0.058 0.019 -0.0096 0.010 -0.951 0.341 -0.029 0.010 -0.0022 0.022 -0.103 0.918 -0.045 0.040 -0.0022 0.006 -0.348 0.728 -0.014 0.010 0.1068 0.787 0.136 0.892 -1.436 1.650 -0.0411 1.176 -0.035 0.972 -2.346 2.264 -0.0031 0.054 -0.057 0.954 -0.108 0.102 0.0131 0.295 0.044 0.965 -0.566 0.592 -0.0019 0.013 -0.145 0.885 -0.028 0.024 0.0008 0.020 0.038 0.970 -0.039 0.041 5.978e-05 0.001 0.068 0.946 -0.002 0.002 -0.3097 4.233 -0.073 0.942 -8.606 7.987 0.0096 0.919 0.010 0.992 -1.792 1.811 0.0066 0.206 0.032 0.974 -0.397 0.410 -0.0143 0.474 -0.030 0.976 -0.944 0.916 0.0105 0.098 0.107 0.915 -0.182 0.203 -0.0001 0.022 -0.005 0.996 -0.044 0.044 -0.0002 0.005 -0.051 0.959 -0.010 0.009

As expected, the coefficient for double_it(LOWINC) in the second model is half the size of the LOWINC coefficient from the first model:

In [4]:
print(mod1.params[1])
print(mod2.params[1] * 2)

-0.0203959871548
-0.0203959871548



#### Previous topic

Generalized Linear Models

#### Next topic

statsmodels.genmod.generalized_linear_model.GLM