Logo

Ordinary Least Squares

In [1]: import numpy as np

In [2]: import statsmodels.api as sm

In [3]: import matplotlib.pyplot as plt

In [4]: from statsmodels.sandbox.regression.predstd import wls_prediction_std

In [5]: np.random.seed(9876789)

OLS Estimation

Artificial data

In [6]: nsample = 100

In [7]: x = np.linspace(0, 10, 100)

In [8]: X = np.column_stack((x, x**2))

In [9]: beta = np.array([1, 0.1, 10])

In [10]: e = np.random.normal(size=nsample)

Our model needs an intercept so we add a column of 1s:

In [11]: X = sm.add_constant(X, prepend=False)

In [12]: y = np.dot(X, beta) + e

Inspect data

In [13]: print X[:5, :]
[[ 0.      0.      1.    ]
 [ 0.101   0.0102  1.    ]
 [ 0.202   0.0408  1.    ]
 [ 0.303   0.0918  1.    ]
 [ 0.404   0.1632  1.    ]]

In [14]: print y[:5]
[  9.1595  11.6995  10.6716   9.8041  13.3547]

Fit and summary

In [15]: model = sm.OLS(y, X)

In [16]: results = model.fit()

In [17]: print results.summary()
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.968
Model:                            OLS   Adj. R-squared:                  0.968
Method:                 Least Squares   F-statistic:                     1479.
Date:                Wed, 14 Aug 2013   Prob (F-statistic):           2.19e-73
Time:                        17:19:39   Log-Likelihood:                -146.51
No. Observations:                 100   AIC:                             299.0
Df Residuals:                      97   BIC:                             306.8
Df Model:                           2                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1             0.8598      0.145      5.948      0.000         0.573     1.147
x2             0.1103      0.014      7.883      0.000         0.082     0.138
const         10.3423      0.313     33.072      0.000         9.722    10.963
==============================================================================
Omnibus:                        2.042   Durbin-Watson:                   2.274
Prob(Omnibus):                  0.360   Jarque-Bera (JB):                1.875
Skew:                           0.234   Prob(JB):                        0.392
Kurtosis:                       2.519   Cond. No.                         144.
==============================================================================

Quantities of interest can be extracted directly from the fitted model. Type dir(results) for a full list. Here are some examples:

In [18]: print results.params
[  0.8598   0.1103  10.3423]

In [19]: print results.rsquared
0.968242518536

OLS non-linear curve but linear in parameters

Artificial data

Non-linear relationship between x and y

In [20]: nsample = 50

In [21]: sig = 0.5

In [22]: x = np.linspace(0, 20, nsample)

In [23]: X = np.c_[x, np.sin(x), (x - 5)**2, np.ones(nsample)]

In [24]: beta = [0.5, 0.5, -0.02, 5.]

In [25]: y_true = np.dot(X, beta)

In [26]: y = y_true + sig * np.random.normal(size=nsample)

Fit and summary

In [27]: res = sm.OLS(y, X).fit()

In [28]: print res.summary()
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.933
Model:                            OLS   Adj. R-squared:                  0.928
Method:                 Least Squares   F-statistic:                     211.8
Date:                Wed, 14 Aug 2013   Prob (F-statistic):           6.30e-27
Time:                        17:19:41   Log-Likelihood:                -34.438
No. Observations:                  50   AIC:                             76.88
Df Residuals:                      46   BIC:                             84.52
Df Model:                           3                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1             0.4687      0.026     17.751      0.000         0.416     0.522
x2             0.4836      0.104      4.659      0.000         0.275     0.693
x3            -0.0174      0.002     -7.507      0.000        -0.022    -0.013
const          5.2058      0.171     30.405      0.000         4.861     5.550
==============================================================================
Omnibus:                        0.655   Durbin-Watson:                   2.896
Prob(Omnibus):                  0.721   Jarque-Bera (JB):                0.360
Skew:                           0.207   Prob(JB):                        0.835
Kurtosis:                       3.026   Cond. No.                         221.
==============================================================================

Extract other quantities of interest

In [29]: print res.params
[ 0.4687  0.4836 -0.0174  5.2058]

In [30]: print res.bse
[ 0.0264  0.1038  0.0023  0.1712]

In [31]: print res.predict()
[  4.7707   5.2221   5.6362   5.9866   6.2564   6.4412   6.5493   6.6009
   6.6243   6.6518   6.7138   6.8341   7.0262   7.2905   7.6149   7.9763
   8.3446   8.6876   8.9764   9.19     9.3187   9.3659   9.3474   9.2889
   9.2217   9.1775   9.1834   9.2571   9.4044   9.6181   9.879   10.1591
  10.4266  10.6505  10.8063  10.8795  10.8683  10.7838  10.6483  10.4913
  10.3452  10.2393  10.1957  10.2249  10.3249  10.4808  10.6678  10.8549
  11.0101  11.1058]

Draw a plot to compare the true relationship to OLS predictions. Confidence intervals around the predictions are built using the wls_prediction_std command.

In [32]: plt.figure();

In [33]: plt.plot(x, y, 'o', x, y_true, 'b-');

In [34]: prstd, iv_l, iv_u = wls_prediction_std(res);

In [35]: plt.plot(x, res.fittedvalues, 'r--.');

In [36]: plt.plot(x, iv_u, 'r--');

In [37]: plt.plot(x, iv_l, 'r--');

In [38]: plt.title('blue: true,   red: OLS');
../../_images/ols_predict_0.png

OLS with dummy variables

Artificial data

We create 3 groups which will be modelled using dummy variables. Group 0 is the omitted/benchmark category.

In [39]: nsample = 50

In [40]: groups = np.zeros(nsample, int)

In [41]: groups[20:40] = 1

In [42]: groups[40:] = 2

In [43]: dummy = (groups[:, None] == np.unique(groups)).astype(float)

In [44]: x = np.linspace(0, 20, nsample)

In [45]: X = np.c_[x, dummy[:, 1:], np.ones(nsample)]

In [46]: beta = [1., 3, -3, 10]

In [47]: y_true = np.dot(X, beta)

In [48]: e = np.random.normal(size=nsample)

In [49]: y = y_true + e

Inspect the data

In [50]: print X[:5, :]
[[ 0.      0.      0.      1.    ]
 [ 0.4082  0.      0.      1.    ]
 [ 0.8163  0.      0.      1.    ]
 [ 1.2245  0.      0.      1.    ]
 [ 1.6327  0.      0.      1.    ]]

In [51]: print y[:5]
[  9.2822  10.5048  11.8439  10.3851  12.3794]

In [52]: print groups
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 2 2 2 2 2 2 2 2 2 2]

In [53]: print dummy[:5, :]
[[ 1.  0.  0.]
 [ 1.  0.  0.]
 [ 1.  0.  0.]
 [ 1.  0.  0.]
 [ 1.  0.  0.]]

Fit and summary

In [54]: res2 = sm.OLS(y, X).fit()

In [55]: print res.summary()
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.933
Model:                            OLS   Adj. R-squared:                  0.928
Method:                 Least Squares   F-statistic:                     211.8
Date:                Wed, 14 Aug 2013   Prob (F-statistic):           6.30e-27
Time:                        17:19:46   Log-Likelihood:                -34.438
No. Observations:                  50   AIC:                             76.88
Df Residuals:                      46   BIC:                             84.52
Df Model:                           3                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1             0.4687      0.026     17.751      0.000         0.416     0.522
x2             0.4836      0.104      4.659      0.000         0.275     0.693
x3            -0.0174      0.002     -7.507      0.000        -0.022    -0.013
const          5.2058      0.171     30.405      0.000         4.861     5.550
==============================================================================
Omnibus:                        0.655   Durbin-Watson:                   2.896
Prob(Omnibus):                  0.721   Jarque-Bera (JB):                0.360
Skew:                           0.207   Prob(JB):                        0.835
Kurtosis:                       3.026   Cond. No.                         221.
==============================================================================

In [56]: print res2.params
[  0.9999   2.8909  -3.2232  10.1031]

In [57]: print res2.bse
[ 0.0599  0.5689  0.927   0.3102]

In [58]: print res.predict()
[  4.7707   5.2221   5.6362   5.9866   6.2564   6.4412   6.5493   6.6009
   6.6243   6.6518   6.7138   6.8341   7.0262   7.2905   7.6149   7.9763
   8.3446   8.6876   8.9764   9.19     9.3187   9.3659   9.3474   9.2889
   9.2217   9.1775   9.1834   9.2571   9.4044   9.6181   9.879   10.1591
  10.4266  10.6505  10.8063  10.8795  10.8683  10.7838  10.6483  10.4913
  10.3452  10.2393  10.1957  10.2249  10.3249  10.4808  10.6678  10.8549
  11.0101  11.1058]

Draw a plot to compare the true relationship to OLS predictions.

In [59]: prstd, iv_l, iv_u = wls_prediction_std(res2);

In [60]: plt.figure();

In [61]: plt.plot(x, y, 'o', x, y_true, 'b-');

In [62]: plt.plot(x, res2.fittedvalues, 'r--.');

In [63]: plt.plot(x, iv_u, 'r--');

In [64]: plt.plot(x, iv_l, 'r--');

In [65]: plt.title('blue: true,   red: OLS');
../../_images/ols_predict_1.png

Joint hypothesis tests

F test

We want to test the hypothesis that both coefficients on the dummy variables are equal to zero, that is, R \times \beta = 0. An F test leads us to strongly reject the null hypothesis of identical constant in the 3 groups:

In [66]: R = [[0, 1, 0, 0], [0, 0, 1, 0]]

In [67]: print np.array(R)
[[0 1 0 0]
 [0 0 1 0]]

In [68]: print res2.f_test(R)
<F test: F=array([[ 145.4927]]), p=[[ 0.]], df_denom=46, df_num=2>

T test

We want to test the null hypothesis that the effects of the 2nd and 3rd groups add to zero. The T-test allows us to reject the Null (but note the one-sided p-value):

In [69]: R = [0, 1, -1, 0]

In [70]: print res2.t_test(R)
<T test: effect=array([ 6.114]), sd=array([[ 0.5111]]), t=array([[ 11.9616]]), p=array([[ 0.]]), df_denom=46>

Small group effects

If we generate artificial data with smaller group effects, the T test can no longer reject the Null hypothesis:

In [71]: beta = [1., 0.3, -0.0, 10]

In [72]: y_true = np.dot(X, beta)

In [73]: y = y_true + np.random.normal(size=nsample)

In [74]: res3 = sm.OLS(y, X).fit()

In [75]: print res3.f_test(R)
<F test: F=array([[ 0.0151]]), p=[[ 0.9028]], df_denom=46, df_num=1>

Multicollinearity

Data

The Longley dataset is well known to have high multicollinearity, that is, the exogenous predictors are highly correlated. This is problematic because it can affect the stability of our coefficient estimates as we make minor changes to model specification.

In [76]: from statsmodels.datasets.longley import load

In [77]: y = load().endog

In [78]: X = load().exog

In [79]: X = sm.tools.add_constant(X, prepend=False)

Fit and summary

In [80]: ols_model = sm.OLS(y, X)

In [81]: ols_results = ols_model.fit()

In [82]: print ols_results.summary()
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.995
Model:                            OLS   Adj. R-squared:                  0.992
Method:                 Least Squares   F-statistic:                     330.3
Date:                Wed, 14 Aug 2013   Prob (F-statistic):           4.98e-10
Time:                        17:19:51   Log-Likelihood:                -109.62
No. Observations:                  16   AIC:                             233.2
Df Residuals:                       9   BIC:                             238.6
Df Model:                           6                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1            15.0619     84.915      0.177      0.863      -177.029   207.153
x2            -0.0358      0.033     -1.070      0.313        -0.112     0.040
x3            -2.0202      0.488     -4.136      0.003        -3.125    -0.915
x4            -1.0332      0.214     -4.822      0.001        -1.518    -0.549
x5            -0.0511      0.226     -0.226      0.826        -0.563     0.460
x6          1829.1515    455.478      4.016      0.003       798.788  2859.515
const      -3.482e+06    8.9e+05     -3.911      0.004      -5.5e+06 -1.47e+06
==============================================================================
Omnibus:                        0.749   Durbin-Watson:                   2.559
Prob(Omnibus):                  0.688   Jarque-Bera (JB):                0.684
Skew:                           0.420   Prob(JB):                        0.710
Kurtosis:                       2.434   Cond. No.                     4.86e+09
==============================================================================

Warnings:
[1] The condition number is large, 4.86e+09. This might indicate that there are
strong multicollinearity or other numerical problems.

Condition number

One way to assess multicollinearity is to compute the condition number. Values over 20 are worrisome (see Greene 4.9). The first step is to normalize the independent variables to have unit length:

In [83]: norm_x = np.ones_like(X)

In [84]: for i in range(int(ols_model.df_model)):
   ....:      norm_x[:, i] = X[:, i] / np.linalg.norm(X[:, i])
   ....: 

In [85]: norm_xtx = np.dot(norm_x.T, norm_x)

Then, we take the square root of the ratio of the biggest to the smallest eigen values.

In [86]: eigs = np.linalg.eigvals(norm_xtx)

In [87]: condition_number = np.sqrt(eigs.max() / eigs.min())

In [88]: print condition_number
56240.8697597

Dropping an observation

Greene also points out that dropping a single observation can have a dramatic effect on the coefficient estimates:

In [89]: ols_results2 = sm.OLS(y[:-1], X[:-1, :]).fit()

In [90]: res_dropped = ols_results.params / ols_results2.params * 100 - 100

In [91]: print 'Percentage change %4.2f%%\n' * 7 % tuple(i for i in res_dropped)
Percentage change -173.43%
Percentage change 31.04%
Percentage change 3.48%
Percentage change 7.83%
Percentage change -199.54%
Percentage change 15.39%
Percentage change 15.40%