Statsmodels 0.6.0 is another large release. It is the result of the work of 37 authors over the last year and includes over 1500 commits. It contains many new features, improvements, and bug fixes detailed below.
See the list of fixed issues for specific closed issues.
The following major new features appear in this version.
Generalized Estimating Equations (GEE) provide an approach to handling dependent data in a regression analysis. Dependent data arise commonly in practice, such as in a longitudinal study where repeated observations are collected on subjects. GEE can be viewed as an extension of the generalized linear modeling (GLM) framework to the dependent data setting. The familiar GLM families such as the Gaussian, Poisson, and logistic families can be used to accommodate dependent variables with various distributions.
Here is an example of GEE Poisson regression in a data set with four count-type repeated measures per subject, and three explanatory covariates.
import numpy as np import statsmodels.api as sm import statsmodels.formula.api as smf data = sm.datasets.get_rdataset("epil", "MASS").data md = smf.gee("y ~ age + trt + base", "subject", data, cov_struct=sm.cov_struct.Independence(), family=sm.families.Poisson()) mdf = md.fit() print mdf.summary()
The dependence structure in a GEE is treated as a nuisance parameter and is modeled in terms of a “working dependence structure”. The statsmodels GEE implementation currently includes five working dependence structures (independent, exchangeable, autoregressive, nested, and a global odds ratio for working with categorical data). Since the GEE estimates are not maximum likelihood estimates, alternative approaches to some common inference procedures have been developed. The statsmodels GEE implementation currently provides standard errors, Wald tests, score tests for arbitrary parameter contrasts, and estimates and tests for marginal effects. Several forms of standard errors are provided, including robust standard errors that are approximately correct even if the working dependence structure is misspecified.
Adding functionality to look at seasonality in plots. Two new functions are sm.graphics.tsa.month_plot and sm.graphics.tsa.quarter_plot. Another function sm.graphics.tsa.seasonal_plot is available for power users.
import statsmodels.api as sm import pandas as pd dta = sm.datasets.elnino.load_pandas().data dta['YEAR'] = dta.YEAR.astype(int).astype(str) dta = dta.set_index('YEAR').T.unstack() dates = map(lambda x : pd.datetools.parse('1 '+' '.join(x)), dta.index.values) dta.index = pd.DatetimeIndex(dates, freq='M') fig = sm.tsa.graphics.month_plot(dta)
We added a naive seasonal decomposition tool in the same vein as R’s decompose. This function can be found as sm.tsa.seasonal_decompose.
import statsmodels.api as sm dta = sm.datasets.co2.load_pandas().data # deal with missing values. see issue dta.co2.interpolate(inplace=True) res = sm.tsa.seasonal_decompose(dta.co2) res.plot()
Addition of Linear Mixed Effects Models (MixedLM)
Linear Mixed Effects models are used for regression analyses involving dependent data. Such data arise when working with longitudinal and other study designs in which multiple observations are made on each subject. Two specific mixed effects models are “random intercepts models”, where all responses in a single group are additively shifted by a value that is specific to the group, and “random slopes models”, where the values follow a mean trajectory that is linear in observed covariates, with both the slopes and intercept being specific to the group. The Statsmodels MixedLM implementation allows arbitrary random effects design matrices to be specified for the groups, so these and other types of random effects models can all be fit.
Here is an example of fitting a random intercepts model to data from a longitudinal study:
import statsmodels.api as sm import statsmodels.formula.api as smf data = sm.datasets.get_rdataset('dietox', 'geepack', cache=True).data md = smf.mixedlm("Weight ~ Time", data, groups=data["Pig"]) mdf = md.fit() print mdf.summary()
The Statsmodels LME framework currently supports post-estimation inference via Wald tests and confidence intervals on the coefficients, profile likelihood analysis, likelihood ratio testing, and AIC. Some limitations of the current implementation are that it does not support structure more complex on the residual errors (they are always homoscedastic), and it does not support crossed random effects. We hope to implement these features for the next release.
It is now possible to call out to X-12-ARIMA or X-13ARIMA-SEATS from statsmodels. These libraries must be installed separately.
import statsmodels.api as sm dta = sm.datasets.co2.load_pandas().data dta.co2.interpolate(inplace=True) dta = dta.resample('M') res = sm.tsa.x13_arima_select_order(dta.co2) print(res.order, res.sorder) results = sm.tsa.x13_arima_analysis(dta.co2) fig = results.plot() fig.set_size_inches(12, 5) fig.tight_layout()
The previous version (0.5.0) was released August 14, 2014. Since then we have closed a total of 528 issues, 276 pull requests, and 252 regular issues. Refer to the detailed list for more information.
This release is a result of the work of the following 37 authors who contributed a total of 1531 commits. If for any reason we have failed to list your name in the below, please contact us:
A blurb about the number of changes and the contributors list.
Obtained by running git log v0.5.0..HEAD --format='* %aN <%aE>' | sed 's/@/\-at\-/' | sed 's/<>//' | sort -u.