# Interactions and ANOVA¶

Note

This script is based heavily on Jonathan Taylor’s class notes http://www.stanford.edu/class/stats191/interactions.html

```In [1]: from urllib2 import urlopen

In [2]: import numpy as np

In [4]: import pandas

In [5]: import matplotlib.pyplot as plt

In [6]: from statsmodels.formula.api import ols

In [7]: from statsmodels.graphics.api import interaction_plot, abline_plot

In [8]: from statsmodels.stats.anova import anova_lm
```

```In [9]: try:
...: except:
...:      print "fetching from website"
...:      url = 'http://stats191.stanford.edu/data/salary.table'
...:      #the next line is not necessary with recent version of pandas
...:      url = urlopen(url)
...:      salary_table.to_csv('salary.table', index=False)
...:

In [10]: E = salary_table.E

In [11]: M = salary_table.M

In [12]: X = salary_table.X

In [13]: S = salary_table.S
```

Take a look at the data

```In [14]: plt.figure(figsize=(6, 6));

In [15]: symbols = ['D', '^']

In [16]: colors = ['r', 'g', 'blue']

In [17]: factor_groups = salary_table.groupby(['E', 'M'])

In [18]: for values, group in factor_groups:
....:      i, j = values
....:      plt.scatter(group['X'], group['S'], marker=symbols[j], color=colors[i - 1],
....:                  s=144)
....:

In [19]: plt.xlabel('Experience');

In [20]: plt.ylabel('Salary');
```

Fit a linear model

```In [21]: formula = 'S ~ C(E) + C(M) + X'

In [22]: lm = ols(formula, salary_table).fit()

In [23]: print lm.summary()
OLS Regression Results
==============================================================================
Dep. Variable:                      S   R-squared:                       0.957
Method:                 Least Squares   F-statistic:                     226.8
Date:                Wed, 14 Aug 2013   Prob (F-statistic):           2.23e-27
Time:                        17:19:09   Log-Likelihood:                -381.63
No. Observations:                  46   AIC:                             773.3
Df Residuals:                      41   BIC:                             782.4
Df Model:                           4
==============================================================================
coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept   8035.5976    386.689     20.781      0.000      7254.663  8816.532
C(E)[T.2]   3144.0352    361.968      8.686      0.000      2413.025  3875.045
C(E)[T.3]   2996.2103    411.753      7.277      0.000      2164.659  3827.762
C(M)[T.1]   6883.5310    313.919     21.928      0.000      6249.559  7517.503
X            546.1840     30.519     17.896      0.000       484.549   607.819
==============================================================================
Omnibus:                        2.293   Durbin-Watson:                   2.237
Prob(Omnibus):                  0.318   Jarque-Bera (JB):                1.362
Skew:                          -0.077   Prob(JB):                        0.506
Kurtosis:                       2.171   Cond. No.                         33.5
==============================================================================
```

Have a look at the created design matrix

```In [24]: lm.model.exog[:20]
Out[24]:
array([[ 1.,  0.,  0.,  1.,  1.],
[ 1.,  0.,  1.,  0.,  1.],
[ 1.,  0.,  1.,  1.,  1.],
[ 1.,  1.,  0.,  0.,  1.],
[ 1.,  0.,  1.,  0.,  1.],
[ 1.,  1.,  0.,  1.,  2.],
[ 1.,  1.,  0.,  0.,  2.],
[ 1.,  0.,  0.,  0.,  2.],
[ 1.,  0.,  1.,  0.,  2.],
[ 1.,  1.,  0.,  0.,  3.],
[ 1.,  0.,  0.,  1.,  3.],
[ 1.,  1.,  0.,  1.,  3.],
[ 1.,  0.,  1.,  1.,  3.],
[ 1.,  0.,  0.,  0.,  4.],
[ 1.,  0.,  1.,  1.,  4.],
[ 1.,  0.,  1.,  0.,  4.],
[ 1.,  1.,  0.,  0.,  4.],
[ 1.,  1.,  0.,  0.,  5.],
[ 1.,  0.,  1.,  0.,  5.],
[ 1.,  0.,  0.,  1.,  5.]])
```

Or since we initially passed in a DataFrame, we have a DataFrame available in

```In [25]: lm.model.data.orig_exog
Out[25]:
Intercept  C(E)[T.2]  C(E)[T.3]  C(M)[T.1]   X
0           1          0          0          1   1
1           1          0          1          0   1
2           1          0          1          1   1
3           1          1          0          0   1
4           1          0          1          0   1
5           1          1          0          1   2
6           1          1          0          0   2
7           1          0          0          0   2
8           1          0          1          0   2
9           1          1          0          0   3
10          1          0          0          1   3
11          1          1          0          1   3
12          1          0          1          1   3
13          1          0          0          0   4
14          1          0          1          1   4
15          1          0          1          0   4
16          1          1          0          0   4
17          1          1          0          0   5
18          1          0          1          0   5
19          1          0          0          1   5
20          1          0          0          0   6
21          1          0          1          1   6
22          1          1          0          0   6
23          1          1          0          1   6
24          1          0          0          1   7
25          1          1          0          0   8
26          1          0          0          1   8
27          1          0          1          1   8
28          1          0          0          0   8
29          1          0          0          0  10
30          1          1          0          0  10
31          1          0          1          1  10
32          1          1          0          1  10
33          1          1          0          1  11
34          1          0          0          0  11
35          1          1          0          0  12
36          1          0          1          1  12
37          1          0          0          0  13
38          1          1          0          1  13
39          1          1          0          0  14
40          1          0          1          1  15
41          1          1          0          1  16
42          1          1          0          0  16
43          1          0          0          0  16
44          1          1          0          0  17
45          1          0          0          0  20
```

We keep a reference to the original untouched data in

```In [26]: lm.model.data.frame
Out[26]:
S   X  E  M
0   13876   1  1  1
1   11608   1  3  0
2   18701   1  3  1
3   11283   1  2  0
4   11767   1  3  0
5   20872   2  2  1
6   11772   2  2  0
7   10535   2  1  0
8   12195   2  3  0
9   12313   3  2  0
10  14975   3  1  1
11  21371   3  2  1
12  19800   3  3  1
13  11417   4  1  0
14  20263   4  3  1
15  13231   4  3  0
16  12884   4  2  0
17  13245   5  2  0
18  13677   5  3  0
19  15965   5  1  1
20  12336   6  1  0
21  21352   6  3  1
22  13839   6  2  0
23  22884   6  2  1
24  16978   7  1  1
25  14803   8  2  0
26  17404   8  1  1
27  22184   8  3  1
28  13548   8  1  0
29  14467  10  1  0
30  15942  10  2  0
31  23174  10  3  1
32  23780  10  2  1
33  25410  11  2  1
34  14861  11  1  0
35  16882  12  2  0
36  24170  12  3  1
37  15990  13  1  0
38  26330  13  2  1
39  17949  14  2  0
40  25685  15  3  1
41  27837  16  2  1
42  18838  16  2  0
43  17483  16  1  0
44  19207  17  2  0
45  19346  20  1  0
```

Get influence statistics

```In [27]: infl = lm.get_influence()

In [28]: print infl.summary_table()
==================================================================================================
obs      endog     fitted     Cook's   student.   hat diag    dffits   ext.stud.     dffits
value          d   residual              internal   residual
--------------------------------------------------------------------------------------------------
0  13876.000  15465.313      0.104     -1.683      0.155     -0.722     -1.723     -0.739
1  11608.000  11577.992      0.000      0.031      0.130      0.012      0.031      0.012
2  18701.000  18461.523      0.001      0.247      0.109      0.086      0.244      0.085
3  11283.000  11725.817      0.005     -0.458      0.113     -0.163     -0.453     -0.162
4  11767.000  11577.992      0.001      0.197      0.130      0.076      0.195      0.075
5  20872.000  19155.532      0.092      1.787      0.126      0.678      1.838      0.698
6  11772.000  12272.001      0.006     -0.513      0.101     -0.172     -0.509     -0.170
7  10535.000   9127.966      0.056      1.457      0.116      0.529      1.478      0.537
8  12195.000  12124.176      0.000      0.074      0.123      0.028      0.073      0.027
9  12313.000  12818.185      0.005     -0.516      0.091     -0.163     -0.511     -0.161
10  14975.000  16557.681      0.084     -1.655      0.134     -0.650     -1.692     -0.664
11  21371.000  19701.716      0.078      1.728      0.116      0.624      1.772      0.640
12  19800.000  19553.891      0.001      0.252      0.096      0.082      0.249      0.081
13  11417.000  10220.334      0.033      1.227      0.098      0.405      1.234      0.408
14  20263.000  20100.075      0.001      0.166      0.093      0.053      0.165      0.053
15  13231.000  13216.544      0.000      0.015      0.114      0.005      0.015      0.005
16  12884.000  13364.369      0.004     -0.488      0.082     -0.146     -0.483     -0.145
17  13245.000  13910.553      0.007     -0.674      0.075     -0.192     -0.669     -0.191
18  13677.000  13762.728      0.000     -0.089      0.113     -0.032     -0.087     -0.031
19  15965.000  17650.049      0.082     -1.747      0.119     -0.642     -1.794     -0.659
20  12336.000  11312.702      0.021      1.043      0.087      0.323      1.044      0.323
21  21352.000  21192.443      0.001      0.163      0.091      0.052      0.161      0.051
22  13839.000  14456.737      0.006     -0.624      0.070     -0.171     -0.619     -0.170
23  22884.000  21340.268      0.052      1.579      0.095      0.511      1.610      0.521
24  16978.000  18742.417      0.083     -1.822      0.111     -0.644     -1.877     -0.664
25  14803.000  15549.105      0.008     -0.751      0.065     -0.199     -0.747     -0.198
26  17404.000  19288.601      0.093     -1.944      0.110     -0.684     -2.016     -0.709
27  22184.000  22284.811      0.000     -0.103      0.096     -0.034     -0.102     -0.033
28  13548.000  12405.070      0.025      1.162      0.083      0.350      1.167      0.352
29  14467.000  13497.438      0.018      0.987      0.086      0.304      0.987      0.304
30  15942.000  16641.473      0.007     -0.705      0.068     -0.190     -0.701     -0.189
31  23174.000  23377.179      0.001     -0.209      0.108     -0.073     -0.207     -0.072
32  23780.000  23525.004      0.001      0.260      0.092      0.083      0.257      0.082
33  25410.000  24071.188      0.040      1.370      0.096      0.446      1.386      0.451
34  14861.000  14043.622      0.014      0.834      0.091      0.263      0.831      0.262
35  16882.000  17733.841      0.012     -0.863      0.077     -0.249     -0.860     -0.249
36  24170.000  24469.547      0.003     -0.312      0.127     -0.119     -0.309     -0.118
37  15990.000  15135.990      0.018      0.878      0.104      0.300      0.876      0.299
38  26330.000  25163.556      0.035      1.202      0.109      0.420      1.209      0.422
39  17949.000  18826.209      0.017     -0.897      0.093     -0.288     -0.895     -0.287
40  25685.000  26108.099      0.008     -0.452      0.169     -0.204     -0.447     -0.202
41  27837.000  26802.108      0.039      1.087      0.141      0.440      1.089      0.441
42  18838.000  19918.577      0.033     -1.119      0.117     -0.407     -1.123     -0.408
43  17483.000  16774.542      0.018      0.743      0.138      0.297      0.739      0.295
44  19207.000  20464.761      0.052     -1.313      0.131     -0.511     -1.325     -0.515
45  19346.000  18959.278      0.009      0.423      0.208      0.216      0.419      0.214
==================================================================================================
```

or get a dataframe

```In [29]: df_infl = infl.summary_frame()
```

Now plot the reiduals within the groups separately

```In [30]: resid = lm.resid

In [31]: plt.figure(figsize=(6, 6));

In [32]: for values, group in factor_groups:
....:      i, j = values
....:      group_num = i * 2 + j - 1  # for plotting purposes
....:      x = [group_num] * len(group)
....:      plt.scatter(x, resid[group.index], marker=symbols[j], color=colors[i - 1],
....:                  s=144, edgecolors='black')
....:

In [33]: plt.xlabel('Group');

In [34]: plt.ylabel('Residuals');
```

now we will test some interactions using anova or f_test

```In [35]: interX_lm = ols("S ~ C(E) * X + C(M)", salary_table).fit()

In [36]: print interX_lm.summary()
OLS Regression Results
==============================================================================
Dep. Variable:                      S   R-squared:                       0.961
Method:                 Least Squares   F-statistic:                     158.6
Date:                Wed, 14 Aug 2013   Prob (F-statistic):           8.23e-26
Time:                        17:19:13   Log-Likelihood:                -379.47
No. Observations:                  46   AIC:                             772.9
Df Residuals:                      39   BIC:                             785.7
Df Model:                           6
===============================================================================
coef    std err          t      P>|t|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------
Intercept    7256.2800    549.494     13.205      0.000      6144.824  8367.736
C(E)[T.2]    4172.5045    674.966      6.182      0.000      2807.256  5537.753
C(E)[T.3]    3946.3649    686.693      5.747      0.000      2557.396  5335.333
C(M)[T.1]    7102.4539    333.442     21.300      0.000      6428.005  7776.903
X             632.2878     53.185     11.888      0.000       524.710   739.865
C(E)[T.2]:X  -125.5147     69.863     -1.797      0.080      -266.826    15.796
C(E)[T.3]:X  -141.2741     89.281     -1.582      0.122      -321.861    39.313
==============================================================================
Omnibus:                        0.432   Durbin-Watson:                   2.179
Prob(Omnibus):                  0.806   Jarque-Bera (JB):                0.590
Skew:                           0.144   Prob(JB):                        0.744
Kurtosis:                       2.526   Cond. No.                         69.7
==============================================================================
```

Do an ANOVA check

```In [37]: table1 = anova_lm(lm, interX_lm)

In [38]: print table1
df_resid              ssr  df_diff         ss_diff         F    Pr(>F)
0        41  43280719.492876        0             NaN       NaN       NaN
1        39  39410679.807560        2  3870039.685316  1.914856  0.160964

In [39]: interM_lm = ols("S ~ X + C(E)*C(M)", data=salary_table).fit()

In [40]: print interM_lm.summary()
OLS Regression Results
==============================================================================
Dep. Variable:                      S   R-squared:                       0.999
Method:                 Least Squares   F-statistic:                     5517.
Date:                Wed, 14 Aug 2013   Prob (F-statistic):           1.67e-55
Time:                        17:19:14   Log-Likelihood:                -298.74
No. Observations:                  46   AIC:                             611.5
Df Residuals:                      39   BIC:                             624.3
Df Model:                           6
=======================================================================================
coef    std err          t      P>|t|      [95.0% Conf. Int.]
---------------------------------------------------------------------------------------
Intercept            9472.6854     80.344    117.902      0.000      9310.175  9635.196
C(E)[T.2]            1381.6706     77.319     17.870      0.000      1225.279  1538.063
C(E)[T.3]            1730.7483    105.334     16.431      0.000      1517.690  1943.806
C(M)[T.1]            3981.3769    101.175     39.351      0.000      3776.732  4186.022
C(E)[T.2]:C(M)[T.1]  4902.5231    131.359     37.322      0.000      4636.825  5168.222
C(E)[T.3]:C(M)[T.1]  3066.0351    149.330     20.532      0.000      2763.986  3368.084
X                     496.9870      5.566     89.283      0.000       485.728   508.246
==============================================================================
Omnibus:                       74.761   Durbin-Watson:                   2.244
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1037.873
Skew:                          -4.103   Prob(JB):                    4.25e-226
Kurtosis:                      24.776   Cond. No.                         79.0
==============================================================================

In [41]: table2 = anova_lm(lm, interM_lm)

In [42]: print table2
df_resid              ssr  df_diff          ss_diff           F        Pr(>F)
0        41  43280719.492876        0              NaN         NaN           NaN
1        39   1178167.864864        2  42102551.628012  696.844466  3.025504e-31
```

The design matrix as a DataFrame

```In [43]: interM_lm.model.data.orig_exog
Out[43]:
Intercept  C(E)[T.2]  C(E)[T.3]  C(M)[T.1]  C(E)[T.2]:C(M)[T.1]  \
0           1          0          0          1                    0
1           1          0          1          0                    0
2           1          0          1          1                    0
3           1          1          0          0                    0
4           1          0          1          0                    0
5           1          1          0          1                    1
6           1          1          0          0                    0
7           1          0          0          0                    0
8           1          0          1          0                    0
9           1          1          0          0                    0
10          1          0          0          1                    0
11          1          1          0          1                    1
12          1          0          1          1                    0
13          1          0          0          0                    0
14          1          0          1          1                    0
15          1          0          1          0                    0
16          1          1          0          0                    0
17          1          1          0          0                    0
18          1          0          1          0                    0
19          1          0          0          1                    0
20          1          0          0          0                    0
21          1          0          1          1                    0
22          1          1          0          0                    0
23          1          1          0          1                    1
24          1          0          0          1                    0
25          1          1          0          0                    0
26          1          0          0          1                    0
27          1          0          1          1                    0
28          1          0          0          0                    0
29          1          0          0          0                    0
30          1          1          0          0                    0
31          1          0          1          1                    0
32          1          1          0          1                    1
33          1          1          0          1                    1
34          1          0          0          0                    0
35          1          1          0          0                    0
36          1          0          1          1                    0
37          1          0          0          0                    0
38          1          1          0          1                    1
39          1          1          0          0                    0
40          1          0          1          1                    0
41          1          1          0          1                    1
42          1          1          0          0                    0
43          1          0          0          0                    0
44          1          1          0          0                    0
45          1          0          0          0                    0

C(E)[T.3]:C(M)[T.1]   X
0                     0   1
1                     0   1
2                     1   1
3                     0   1
4                     0   1
5                     0   2
6                     0   2
7                     0   2
8                     0   2
9                     0   3
10                    0   3
11                    0   3
12                    1   3
13                    0   4
14                    1   4
15                    0   4
16                    0   4
17                    0   5
18                    0   5
19                    0   5
20                    0   6
21                    1   6
22                    0   6
23                    0   6
24                    0   7
25                    0   8
26                    0   8
27                    1   8
28                    0   8
29                    0  10
30                    0  10
31                    1  10
32                    0  10
33                    0  11
34                    0  11
35                    0  12
36                    1  12
37                    0  13
38                    0  13
39                    0  14
40                    1  15
41                    0  16
42                    0  16
43                    0  16
44                    0  17
45                    0  20
```

The design matrix as an ndarray

```In [44]: interM_lm.model.exog
Out[44]:
array([[  1.,   0.,   0.,   1.,   0.,   0.,   1.],
[  1.,   0.,   1.,   0.,   0.,   0.,   1.],
[  1.,   0.,   1.,   1.,   0.,   1.,   1.],
[  1.,   1.,   0.,   0.,   0.,   0.,   1.],
[  1.,   0.,   1.,   0.,   0.,   0.,   1.],
[  1.,   1.,   0.,   1.,   1.,   0.,   2.],
[  1.,   1.,   0.,   0.,   0.,   0.,   2.],
[  1.,   0.,   0.,   0.,   0.,   0.,   2.],
[  1.,   0.,   1.,   0.,   0.,   0.,   2.],
[  1.,   1.,   0.,   0.,   0.,   0.,   3.],
[  1.,   0.,   0.,   1.,   0.,   0.,   3.],
[  1.,   1.,   0.,   1.,   1.,   0.,   3.],
[  1.,   0.,   1.,   1.,   0.,   1.,   3.],
[  1.,   0.,   0.,   0.,   0.,   0.,   4.],
[  1.,   0.,   1.,   1.,   0.,   1.,   4.],
[  1.,   0.,   1.,   0.,   0.,   0.,   4.],
[  1.,   1.,   0.,   0.,   0.,   0.,   4.],
[  1.,   1.,   0.,   0.,   0.,   0.,   5.],
[  1.,   0.,   1.,   0.,   0.,   0.,   5.],
[  1.,   0.,   0.,   1.,   0.,   0.,   5.],
[  1.,   0.,   0.,   0.,   0.,   0.,   6.],
[  1.,   0.,   1.,   1.,   0.,   1.,   6.],
[  1.,   1.,   0.,   0.,   0.,   0.,   6.],
[  1.,   1.,   0.,   1.,   1.,   0.,   6.],
[  1.,   0.,   0.,   1.,   0.,   0.,   7.],
[  1.,   1.,   0.,   0.,   0.,   0.,   8.],
[  1.,   0.,   0.,   1.,   0.,   0.,   8.],
[  1.,   0.,   1.,   1.,   0.,   1.,   8.],
[  1.,   0.,   0.,   0.,   0.,   0.,   8.],
[  1.,   0.,   0.,   0.,   0.,   0.,  10.],
[  1.,   1.,   0.,   0.,   0.,   0.,  10.],
[  1.,   0.,   1.,   1.,   0.,   1.,  10.],
[  1.,   1.,   0.,   1.,   1.,   0.,  10.],
[  1.,   1.,   0.,   1.,   1.,   0.,  11.],
[  1.,   0.,   0.,   0.,   0.,   0.,  11.],
[  1.,   1.,   0.,   0.,   0.,   0.,  12.],
[  1.,   0.,   1.,   1.,   0.,   1.,  12.],
[  1.,   0.,   0.,   0.,   0.,   0.,  13.],
[  1.,   1.,   0.,   1.,   1.,   0.,  13.],
[  1.,   1.,   0.,   0.,   0.,   0.,  14.],
[  1.,   0.,   1.,   1.,   0.,   1.,  15.],
[  1.,   1.,   0.,   1.,   1.,   0.,  16.],
[  1.,   1.,   0.,   0.,   0.,   0.,  16.],
[  1.,   0.,   0.,   0.,   0.,   0.,  16.],
[  1.,   1.,   0.,   0.,   0.,   0.,  17.],
[  1.,   0.,   0.,   0.,   0.,   0.,  20.]])

In [45]: interM_lm.model.exog_names
Out[45]:
['Intercept',
'C(E)[T.2]',
'C(E)[T.3]',
'C(M)[T.1]',
'C(E)[T.2]:C(M)[T.1]',
'C(E)[T.3]:C(M)[T.1]',
'X']

In [46]: infl = interM_lm.get_influence()

In [47]: resid = infl.resid_studentized_internal

In [48]: plt.figure(figsize=(6, 6));

In [49]: for values, group in factor_groups:
....:      i, j = values
....:      idx = group.index
....:      plt.scatter(X[idx], resid[idx], marker=symbols[j], color=colors[i - 1],
....:              s=144, edgecolors='black')
....:

In [50]: plt.xlabel('X');

In [51]: plt.ylabel('standardized resids');
```

Looks like one observation is an outlier. TODO: do we have Bonferonni outlier test?

```In [52]: drop_idx = abs(resid).argmax()

In [53]: print drop_idx  # zero-based index
32

In [54]: idx = salary_table.index.drop([drop_idx])

In [55]: lm32 = ols('S ~ C(E) + X + C(M)', data=salary_table, subset=idx).fit()

In [56]: print lm32.summary()
OLS Regression Results
==============================================================================
Dep. Variable:                      S   R-squared:                       0.955
Method:                 Least Squares   F-statistic:                     211.7
Date:                Wed, 14 Aug 2013   Prob (F-statistic):           2.45e-26
Time:                        17:19:17   Log-Likelihood:                -373.79
No. Observations:                  45   AIC:                             757.6
Df Residuals:                      40   BIC:                             766.6
Df Model:                           4
==============================================================================
coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept   8044.7518    392.781     20.482      0.000      7250.911  8838.592
C(E)[T.2]   3129.5286    370.470      8.447      0.000      2380.780  3878.277
C(E)[T.3]   2999.4451    416.712      7.198      0.000      2157.238  3841.652
C(M)[T.1]   6866.9856    323.991     21.195      0.000      6212.175  7521.796
X            545.7855     30.912     17.656      0.000       483.311   608.260
==============================================================================
Omnibus:                        2.511   Durbin-Watson:                   2.265
Prob(Omnibus):                  0.285   Jarque-Bera (JB):                1.400
Skew:                          -0.044   Prob(JB):                        0.496
Kurtosis:                       2.140   Cond. No.                         33.1
==============================================================================

In [57]: interX_lm32 = ols('S ~ C(E) * X + C(M)', data=salary_table, subset=idx).fit()

In [58]: print interX_lm32.summary()
OLS Regression Results
==============================================================================
Dep. Variable:                      S   R-squared:                       0.959
Method:                 Least Squares   F-statistic:                     147.7
Date:                Wed, 14 Aug 2013   Prob (F-statistic):           8.97e-25
Time:                        17:19:17   Log-Likelihood:                -371.70
No. Observations:                  45   AIC:                             757.4
Df Residuals:                      38   BIC:                             770.0
Df Model:                           6
===============================================================================
coef    std err          t      P>|t|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------
Intercept    7266.0887    558.872     13.001      0.000      6134.711  8397.466
C(E)[T.2]    4162.0846    685.728      6.070      0.000      2773.900  5550.269
C(E)[T.3]    3940.4359    696.067      5.661      0.000      2531.322  5349.549
C(M)[T.1]    7088.6387    345.587     20.512      0.000      6389.035  7788.243
X             631.6892     53.950     11.709      0.000       522.473   740.905
C(E)[T.2]:X  -125.5009     70.744     -1.774      0.084      -268.714    17.712
C(E)[T.3]:X  -139.8410     90.728     -1.541      0.132      -323.511    43.829
==============================================================================
Omnibus:                        0.617   Durbin-Watson:                   2.194
Prob(Omnibus):                  0.734   Jarque-Bera (JB):                0.728
Skew:                           0.162   Prob(JB):                        0.695
Kurtosis:                       2.468   Cond. No.                         68.7
==============================================================================

In [59]: table3 = anova_lm(lm32, interX_lm32)

In [60]: print table3
df_resid              ssr  df_diff         ss_diff         F    Pr(>F)
0        40  43209096.482552        0             NaN       NaN       NaN
1        38  39374237.269069        2  3834859.213482  1.850508  0.171042

In [61]: interM_lm32 = ols('S ~ X + C(E) * C(M)', data=salary_table, subset=idx).fit()

In [62]: table4 = anova_lm(lm32, interM_lm32)

In [63]: print table4
df_resid              ssr  df_diff          ss_diff            F        Pr(>F)
0        40  43209096.482552        0              NaN          NaN           NaN
1        38    171188.119937        2  43037908.362615  4776.734853  2.291239e-46
```

Replot the residuals

```In [64]: try:
....:      resid = interM_lm32.get_influence().summary_frame()['standard_resid']
....: except:
....:      resid = interM_lm32.get_influence().summary_frame()['standard_resid']
....:

In [65]: plt.figure(figsize=(6, 6));

In [66]: for values, group in factor_groups:
....:      i, j = values
....:      idx = group.index
....:      plt.scatter(X[idx], resid[idx], marker=symbols[j], color=colors[i - 1],
....:              s=144, edgecolors='black')
....:

In [67]: plt.xlabel('X[~[32]]');

In [68]: plt.ylabel('standardized resids');
```

Plot the fitted values

```In [69]: lm_final = ols('S ~ X + C(E)*C(M)',
....:                 data=salary_table.drop([drop_idx])).fit()
....:

In [70]: mf = lm_final.model.data.orig_exog

In [71]: lstyle = ['-', '--']

In [72]: plt.figure(figsize=(6, 6));

In [73]: for values, group in factor_groups:
....:      i, j = values
....:      idx = group.index
....:      plt.scatter(X[idx], S[idx], marker=symbols[j], color=colors[i - 1],
....:              s=144, edgecolors='black')
....:      # drop NA because there is no idx 32 in the final model
....:      plt.plot(mf.X[idx].dropna(), lm_final.fittedvalues[idx].dropna(),
....:              ls=lstyle[j], color=colors[i - 1])
....:

In [74]: plt.xlabel('Experience');

In [75]: plt.ylabel('Salary');
```

From our first look at the data, the difference between Master’s and PhD in the management group is different than in the non-management group. This is an interaction between the two qualitative variables management, M and education,E. We can visualize this by first removing the effect of experience, then plotting the means within each of the 6 groups using interaction.plot.

```In [76]: U = S - X * interX_lm32.params['X']

In [77]: plt.figure(figsize=(6, 6));

In [78]: interaction_plot(E, M, U, colors=['red', 'blue'], markers=['^', 'D'],
....:          markersize=10, ax=plt.gca())
....:
Out[78]: <matplotlib.figure.Figure at 0x857cc90>
```

## Minority Employment Data¶

```In [79]: try:
....: except:  # don't have data already
....:      url = 'http://stats191.stanford.edu/data/minority.table'
....:      #the next line is not necessary with recent version of pandas
....:      url = urlopen(url)
....:

In [80]: factor_group = minority_table.groupby(['ETHN'])

In [81]: plt.figure(figsize=(6, 6));

In [82]: colors = ['purple', 'green']

In [83]: markers = ['o', 'v']

In [84]: for factor, group in factor_group:
....:      plt.scatter(group['TEST'], group['JPERF'], color=colors[factor],
....:                  marker=markers[factor], s=12**2)
....:

In [85]: plt.xlabel('TEST');

In [86]: plt.ylabel('JPERF');

In [87]: min_lm = ols('JPERF ~ TEST', data=minority_table).fit()

In [88]: print min_lm.summary()
OLS Regression Results
==============================================================================
Dep. Variable:                  JPERF   R-squared:                       0.517
Method:                 Least Squares   F-statistic:                     19.25
Date:                Wed, 14 Aug 2013   Prob (F-statistic):           0.000356
Time:                        17:19:25   Log-Likelihood:                -36.614
No. Observations:                  20   AIC:                             77.23
Df Residuals:                      18   BIC:                             79.22
Df Model:                           1
==============================================================================
coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept      1.0350      0.868      1.192      0.249        -0.789     2.859
TEST           2.3605      0.538      4.387      0.000         1.230     3.491
==============================================================================
Omnibus:                        0.324   Durbin-Watson:                   2.896
Prob(Omnibus):                  0.850   Jarque-Bera (JB):                0.483
Skew:                          -0.186   Prob(JB):                        0.785
Kurtosis:                       2.336   Cond. No.                         5.26
==============================================================================

In [89]: plt.figure(figsize=(6, 6));

In [90]: for factor, group in factor_group:
....:      plt.scatter(group['TEST'], group['JPERF'], color=colors[factor],
....:                  marker=markers[factor], s=12**2)
....:

In [91]: plt.xlabel('TEST')
Out[91]: <matplotlib.text.Text at 0x8740e10>

In [92]: plt.ylabel('JPERF')
Out[92]: <matplotlib.text.Text at 0x8725f10>

In [93]: abline_plot(model_results=min_lm, ax=plt.gca())
Out[93]: <matplotlib.figure.Figure at 0x87409d0>

In [94]: min_lm2 = ols('JPERF ~ TEST + TEST:ETHN',
....:          data=minority_table).fit()
....:

In [95]: print min_lm2.summary()
OLS Regression Results
==============================================================================
Dep. Variable:                  JPERF   R-squared:                       0.632
Method:                 Least Squares   F-statistic:                     14.59
Date:                Wed, 14 Aug 2013   Prob (F-statistic):           0.000204
Time:                        17:19:26   Log-Likelihood:                -33.891
No. Observations:                  20   AIC:                             73.78
Df Residuals:                      17   BIC:                             76.77
Df Model:                           2
==============================================================================
coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept      1.1211      0.780      1.437      0.169        -0.525     2.768
TEST           1.8276      0.536      3.412      0.003         0.698     2.958
TEST:ETHN      0.9161      0.397      2.306      0.034         0.078     1.754
==============================================================================
Omnibus:                        0.388   Durbin-Watson:                   3.008
Prob(Omnibus):                  0.823   Jarque-Bera (JB):                0.514
Skew:                           0.050   Prob(JB):                        0.773
Kurtosis:                       2.221   Cond. No.                         5.96
==============================================================================

In [96]: plt.figure(figsize=(6, 6));

In [97]: for factor, group in factor_group:
....:      plt.scatter(group['TEST'], group['JPERF'], color=colors[factor],
....:                  marker=markers[factor], s=12**2)
....:

In [98]: abline_plot(intercept=min_lm2.params['Intercept'],
....:              slope=min_lm2.params['TEST'], ax=plt.gca(), color='purple')
....:
Out[98]: <matplotlib.figure.Figure at 0x87195d0>

In [99]: abline_plot(intercept=min_lm2.params['Intercept'],
....:              slope=min_lm2.params['TEST'] + min_lm2.params['TEST:ETHN'],
....:              ax=plt.gca(), color='green')
....:
Out[99]: <matplotlib.figure.Figure at 0x87195d0>

In [100]: min_lm3 = ols('JPERF ~ TEST + ETHN', data=minority_table).fit()

In [101]: print min_lm3.summary()
OLS Regression Results
==============================================================================
Dep. Variable:                  JPERF   R-squared:                       0.572
Method:                 Least Squares   F-statistic:                     11.38
Date:                Wed, 14 Aug 2013   Prob (F-statistic):           0.000731
Time:                        17:19:27   Log-Likelihood:                -35.390
No. Observations:                  20   AIC:                             76.78
Df Residuals:                      17   BIC:                             79.77
Df Model:                           2
==============================================================================
coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept      0.6120      0.887      0.690      0.500        -1.260     2.483
TEST           2.2988      0.522      4.400      0.000         1.197     3.401
ETHN           1.0276      0.691      1.487      0.155        -0.430     2.485
==============================================================================
Omnibus:                        0.251   Durbin-Watson:                   3.028
Prob(Omnibus):                  0.882   Jarque-Bera (JB):                0.437
Skew:                          -0.059   Prob(JB):                        0.804
Kurtosis:                       2.286   Cond. No.                         5.72
==============================================================================

In [102]: plt.figure(figsize=(6, 6));

In [103]: for factor, group in factor_group:
.....:      plt.scatter(group['TEST'], group['JPERF'], color=colors[factor],
.....:                  marker=markers[factor], s=12**2)
.....:

In [104]: abline_plot(intercept=min_lm3.params['Intercept'],
.....:              slope=min_lm3.params['TEST'], ax=plt.gca(), color='purple')
.....:
Out[104]: <matplotlib.figure.Figure at 0x9230f50>

In [105]: abline_plot(intercept=min_lm3.params['Intercept'] + min_lm3.params['ETHN'],
.....:              slope=min_lm3.params['TEST'], ax=plt.gca(), color='green')
.....:
Out[105]: <matplotlib.figure.Figure at 0x9230f50>

In [106]: min_lm4 = ols('JPERF ~ TEST * ETHN', data=minority_table).fit()

In [107]: print min_lm4.summary()
OLS Regression Results
==============================================================================
Dep. Variable:                  JPERF   R-squared:                       0.664
Method:                 Least Squares   F-statistic:                     10.55
Date:                Wed, 14 Aug 2013   Prob (F-statistic):           0.000451
Time:                        17:19:28   Log-Likelihood:                -32.971
No. Observations:                  20   AIC:                             73.94
Df Residuals:                      16   BIC:                             77.92
Df Model:                           3
==============================================================================
coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept      2.0103      1.050      1.914      0.074        -0.216     4.236
TEST           1.3134      0.670      1.959      0.068        -0.108     2.735
ETHN          -1.9132      1.540     -1.242      0.232        -5.179     1.352
TEST:ETHN      1.9975      0.954      2.093      0.053        -0.026     4.021
==============================================================================
Omnibus:                        3.377   Durbin-Watson:                   3.015
Prob(Omnibus):                  0.185   Jarque-Bera (JB):                1.330
Skew:                           0.120   Prob(JB):                        0.514
Kurtosis:                       1.760   Cond. No.                         13.8
==============================================================================

In [108]: plt.figure(figsize=(6, 6));

In [109]: for factor, group in factor_group:
.....:      plt.scatter(group['TEST'], group['JPERF'], color=colors[factor],
.....:                  marker=markers[factor], s=12**2)
.....:

In [110]: abline_plot(intercept=min_lm4.params['Intercept'],
.....:              slope=min_lm4.params['TEST'], ax=plt.gca(), color='purple')
.....:
Out[110]: <matplotlib.figure.Figure at 0x92e7dd0>

In [111]: abline_plot(intercept=min_lm4.params['Intercept'] + min_lm4.params['ETHN'],
.....:          slope=min_lm4.params['TEST'] + min_lm4.params['TEST:ETHN'],
.....:          ax=plt.gca(), color='green')
.....:
Out[111]: <matplotlib.figure.Figure at 0x92e7dd0>
```

is there any effect of ETHN on slope or intercept

```In [112]: table5 = anova_lm(min_lm, min_lm4)

In [113]: print table5
df_resid        ssr  df_diff    ss_diff         F    Pr(>F)
0        18  45.568297        0        NaN       NaN       NaN
1        16  31.655473        2  13.912824  3.516061  0.054236
```

is there any effect of ETHN on intercept

```In [114]: table6 = anova_lm(min_lm, min_lm3)

In [115]: print table6
df_resid        ssr  df_diff   ss_diff         F    Pr(>F)
0        18  45.568297        0       NaN       NaN       NaN
1        17  40.321546        1  5.246751  2.212087  0.155246
```

is there any effect of ETHN on slope

```In [116]: table7 = anova_lm(min_lm, min_lm2)

In [117]: print table7
df_resid        ssr  df_diff    ss_diff         F    Pr(>F)
0        18  45.568297        0        NaN       NaN       NaN
1        17  34.707653        1  10.860644  5.319603  0.033949
```

is it just the slope or both?

```In [118]: table8 = anova_lm(min_lm2, min_lm4)

In [119]: print table8
df_resid        ssr  df_diff  ss_diff         F    Pr(>F)
0        17  34.707653        0      NaN       NaN       NaN
1        16  31.655473        1  3.05218  1.542699  0.232115
```

## One-way ANOVA¶

```In [120]: try:
.....: except:
.....:      url = 'http://stats191.stanford.edu/data/rehab.csv'
.....:      #the next line is not necessary with recent version of pandas
.....:      url = urlopen(url)
.....:      rehab_table.to_csv('rehab.table')
.....:

In [121]: plt.figure(figsize=(6, 6));

In [122]: rehab_table.boxplot('Time', 'Fitness', ax=plt.gca())
Out[122]: <matplotlib.axes.AxesSubplot at 0x855cdd0>

In [123]: rehab_lm = ols('Time ~ C(Fitness)', data=rehab_table).fit()

In [124]: table9 = anova_lm(rehab_lm)

In [125]: print table9
df  sum_sq     mean_sq          F    PR(>F)
C(Fitness)   2     672  336.000000  16.961538  0.000041
Residual    21     416   19.809524        NaN       NaN

In [126]: print rehab_lm.model.data.orig_exog
Intercept  C(Fitness)[T.2]  C(Fitness)[T.3]
0           1                0                0
1           1                0                0
2           1                0                0
3           1                0                0
4           1                0                0
5           1                0                0
6           1                0                0
7           1                0                0
8           1                1                0
9           1                1                0
10          1                1                0
11          1                1                0
12          1                1                0
13          1                1                0
14          1                1                0
15          1                1                0
16          1                1                0
17          1                1                0
18          1                0                1
19          1                0                1
20          1                0                1
21          1                0                1
22          1                0                1
23          1                0                1

In [127]: print rehab_lm.summary()
OLS Regression Results
==============================================================================
Dep. Variable:                   Time   R-squared:                       0.618
Method:                 Least Squares   F-statistic:                     16.96
Date:                Wed, 14 Aug 2013   Prob (F-statistic):           4.13e-05
Time:                        17:19:32   Log-Likelihood:                -68.286
No. Observations:                  24   AIC:                             142.6
Df Residuals:                      21   BIC:                             146.1
Df Model:                           2
===================================================================================
coef    std err          t      P>|t|      [95.0% Conf. Int.]
-----------------------------------------------------------------------------------
Intercept          38.0000      1.574     24.149      0.000        34.728    41.272
C(Fitness)[T.2]    -6.0000      2.111     -2.842      0.010       -10.390    -1.610
C(Fitness)[T.3]   -14.0000      2.404     -5.824      0.000       -18.999    -9.001
==============================================================================
Omnibus:                        0.163   Durbin-Watson:                   2.209
Prob(Omnibus):                  0.922   Jarque-Bera (JB):                0.211
Skew:                          -0.163   Prob(JB):                        0.900
Kurtosis:                       2.675   Cond. No.                         3.80
==============================================================================
```

## Two-way ANOVA¶

```In [128]: try:
.....: except:
.....:      url = 'http://stats191.stanford.edu/data/kidney.table'
.....:      #the next line is not necessary with recent version of pandas
.....:      url = urlopen(url)
.....:      kidney_table = pandas.read_table(url, delimiter=" *")
.....:
```

Explore the dataset

```In [129]: print kidney_table.groupby(['Weight', 'Duration']).size()
Weight  Duration
1       1           10
2           10
2       1           10
2           10
3       1           10
2           10
dtype: int64
```

balanced panel

```In [130]: kt = kidney_table

In [131]: plt.figure(figsize=(6, 6));

In [132]: interaction_plot(kt['Weight'], kt['Duration'], np.log(kt['Days'] + 1),
.....:                   colors=['red', 'blue'], markers=['D', '^'], ms=10,
.....:                   ax=plt.gca())
.....:
Out[132]: <matplotlib.figure.Figure at 0x92fee90>
```

You have things available in the calling namespace available in the formula evaluation namespace

```In [133]: kidney_lm = ols('np.log(Days+1) ~ C(Duration) * C(Weight)', data=kt).fit()

In [134]: table10 = anova_lm(kidney_lm)

In [135]: print anova_lm(ols('np.log(Days+1) ~ C(Duration) + C(Weight)',
.....:                  data=kt).fit(), kidney_lm)
.....:
df_resid        ssr  df_diff   ss_diff        F    Pr(>F)
0        56  29.624856        0       NaN      NaN       NaN
1        54  28.989198        2  0.635658  0.59204  0.556748

In [136]: print anova_lm(ols('np.log(Days+1) ~ C(Duration)', data=kt).fit(),
.....:                 ols('np.log(Days+1) ~ C(Duration) + C(Weight, Sum)',
.....:                     data=kt).fit())
.....:
df_resid        ssr  df_diff    ss_diff          F    Pr(>F)
0        58  46.596147        0        NaN        NaN       NaN
1        56  29.624856        2  16.971291  16.040454  0.000003

In [137]: print anova_lm(ols('np.log(Days+1) ~ C(Weight)', data=kt).fit(),
.....:                 ols('np.log(Days+1) ~ C(Duration) + C(Weight, Sum)',
.....:                     data=kt).fit())
.....:
df_resid        ssr  df_diff   ss_diff         F   Pr(>F)
0        57  31.964549        0       NaN       NaN      NaN
1        56  29.624856        1  2.339693  4.422732  0.03997
```

## Sum of squares¶

Illustrates the use of different types of sums of squares (I,II,II) and how the Sum contrast can be used to produce the same output between the 3.

Types I and II are equivalent under a balanced design.

Don’t use Type III with non-orthogonal contrast - ie., Treatment

```In [138]: sum_lm = ols('np.log(Days+1) ~ C(Duration, Sum) * C(Weight, Sum)',
.....:               data=kt).fit()
.....:

In [139]: print anova_lm(sum_lm)
df     sum_sq   mean_sq          F    PR(>F)
C(Duration, Sum)                  1   2.339693  2.339693   4.358293  0.041562
C(Weight, Sum)                    2  16.971291  8.485645  15.806745  0.000004
C(Duration, Sum):C(Weight, Sum)   2   0.635658  0.317829   0.592040  0.556748
Residual                         54  28.989198  0.536837        NaN       NaN

In [140]: print anova_lm(sum_lm, typ=2)
sum_sq  df          F    PR(>F)
C(Duration, Sum)                  2.339693   1   4.358293  0.041562
C(Weight, Sum)                   16.971291   2  15.806745  0.000004
C(Duration, Sum):C(Weight, Sum)   0.635658   2   0.592040  0.556748
Residual                         28.989198  54        NaN       NaN

In [141]: print anova_lm(sum_lm, typ=3)
sum_sq  df           F        PR(>F)
Intercept                        156.301830   1  291.153237  2.077589e-23
C(Duration, Sum)                   2.339693   1    4.358293  4.156170e-02
C(Weight, Sum)                    16.971291   2   15.806745  3.944502e-06
C(Duration, Sum):C(Weight, Sum)    0.635658   2    0.592040  5.567479e-01
Residual                          28.989198  54         NaN           NaN

In [142]: nosum_lm = ols('np.log(Days+1) ~ C(Duration, Treatment) * C(Weight, Treatment)',
.....:              data=kt).fit()
.....:

In [143]: print anova_lm(nosum_lm)
df     sum_sq   mean_sq          F    PR(>F)
C(Duration, Treatment)                        1   2.339693  2.339693   4.358293  0.041562
C(Weight, Treatment)                          2  16.971291  8.485645  15.806745  0.000004
C(Duration, Treatment):C(Weight, Treatment)   2   0.635658  0.317829   0.592040  0.556748
Residual                                     54  28.989198  0.536837        NaN       NaN

In [144]: print anova_lm(nosum_lm, typ=2)
sum_sq  df          F    PR(>F)
C(Duration, Treatment)                        2.339693   1   4.358293  0.041562
C(Weight, Treatment)                         16.971291   2  15.806745  0.000004
C(Duration, Treatment):C(Weight, Treatment)   0.635658   2   0.592040  0.556748
Residual                                     28.989198  54        NaN       NaN

In [145]: print anova_lm(nosum_lm, typ=3)
sum_sq  df          F    PR(>F)
Intercept                                    10.427596   1  19.424139  0.000050
C(Duration, Treatment)                        0.054293   1   0.101134  0.751699
C(Weight, Treatment)                         11.703387   2  10.900317  0.000106
C(Duration, Treatment):C(Weight, Treatment)   0.635658   2   0.592040  0.556748
Residual                                     28.989198  54        NaN       NaN
```