Logo

Prediction (out of sample)ΒΆ

Link to Notebook GitHub

In [1]:
from __future__ import print_function
import numpy as np
import statsmodels.api as sm

Artificial data

In [2]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [3]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.985
Model:                            OLS   Adj. R-squared:                  0.984
Method:                 Least Squares   F-statistic:                     995.7
Date:                Mon, 17 Nov 2014   Prob (F-statistic):           7.90e-42
Time:                        13:07:09   Log-Likelihood:                 2.2733
No. Observations:                  50   AIC:                             3.453
Df Residuals:                      46   BIC:                             11.10
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          5.0036      0.082     60.899      0.000         4.838     5.169
x1             0.4976      0.013     39.267      0.000         0.472     0.523
x2             0.5335      0.050     10.710      0.000         0.433     0.634
x3            -0.0193      0.001    -17.343      0.000        -0.022    -0.017
==============================================================================
Omnibus:                        0.178   Durbin-Watson:                   2.742
Prob(Omnibus):                  0.915   Jarque-Bera (JB):                0.378
Skew:                          -0.063   Prob(JB):                        0.828
Kurtosis:                       2.593   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [4]:
ypred = olsres.predict(X)
print(ypred)
[  4.52122651   5.01161382   5.46078176   5.83965616   6.12965554
   6.32574407   6.43725885   6.48637592   6.50446674   6.52694402
   6.58744345   6.71229747   6.91620893   7.19983547   7.54968142
   7.94031517   8.33854745   8.70888426   9.01935844   9.24678178
   9.38055475   9.42440803   9.3957895    9.32299766   9.24053225
   9.18342602   9.18148897   9.25441064   9.40852518   9.63577068
   9.91501369  10.21552141  10.50201098  10.74044639  10.90363045
  10.97567327  10.95460286  10.85268919  10.6944282   10.51251724
  10.34248297  10.21684441  10.1597702   10.18310773  10.28443663
  10.44746648  10.64471325  10.84201485  11.00414385  11.10059529]

Create a new sample of explanatory variables Xnew, predict and plot

In [5]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[ 11.09987108  10.95929855  10.69979877  10.36904817  10.02980572
   9.74454733   9.56016957   9.49650776   9.54147958   9.65404327]

Plot comparison

In [6]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [7]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [8]:
res.params
Out[8]:
Intercept           5.003625
x1                  0.497576
np.sin(x1)          0.533480
I((x1 - 5) ** 2)   -0.019296
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [9]:
res.predict(exog=dict(x1=x1n))
Out[9]:
array([ 11.09987108,  10.95929855,  10.69979877,  10.36904817,
        10.02980572,   9.74454733,   9.56016957,   9.49650776,
         9.54147958,   9.65404327])

This Page