r - OLS in Python with Dummy Variables - Best Solution?

Question

I have a problem I am trying to solve in Python, and I have found multiple solutions (I think) but I am trying to figure out which one is the best. I am hoping to choose libraries that will be supported fully in the future so I do not have to re-write this service.

I want to do an ordinary multi-variate least squares regression with both categorical and continuous dependent variables. The code has to be written in Python, as it is being integrated into a web service. I have been following Pandas quite a bit but never used it, so this seems to be one approach:

SOLUTION 1. https://github.com/pydata/pandas/blob/master/examples/regressions.py

Obviously, numpy/scipy are ideal, but I cant find an example that uses dummy variables (does anyone have one???). I did find this though,

SOLUTION 2. http://www.scipy.org/Cookbook/OLS

which I could modify to support dummy variables, but I do not want to do that if someone else has done it already + I want the numbers to be very similar to R, as I have done most of my analysis offline and I can use these results for unit tests.

And in the example (2) above, I see that I could technically use rpy/rpy2, although that is not optimal because my web service requires yet another piece of technology (R). The good thing about using the interface is the numbers would be identical to my results from R.

SOLUTION 3. http://www.scipy.org/Cookbook/OLS (but using Rpy/Rpy2)

Anyways, I am interested in what everyone's approach would be out of these three solutions, if there are any I am missing ...... and if Panda's is mature enough to start using in a production web service. The key thing here is that I do not want to have to support/patch bug fixes or write anything from scratch if possible. I'm too busy and probably not smart enough :)

Thanks.

score 6 · Accepted Answer

您可以使用 statsmodels，它提供了许多不同的模型和结果统计信息

如果您想使用类似 R 的公式界面，这里有一些示例，您可以查看相应的文档：

http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/contrasts.html http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/example_formulas.html

如果你想要一个纯 numpy 版本，那么这是一个从头开始做所有事情的旧示例 http://statsmodels.sourceforge.net/devel/examples/generated/example_ols.html#ols-with-dummy-variables

模型与 pandas 集成，可以使用 pandas DataFrame 作为因变量和自变量（statsmodels 命名约定中的 endog 和 exog）的数据结构。

r - OLS in Python with Dummy Variables - Best Solution?

1 回答 1

Related

Reference