1

Here's my sample program:

import numpy as np
import pandas as pd
import statsmodels
from statsmodels.formula.api import ols
df = pd.DataFrame({"z": [1,1,1,2,2,2,3,3,3],
                   "x":[0,1,2,0,1,2,0,1,2],
                   "y":[0,2,4,3,5,7,7,9,11]
                   })
model = ols("y ~ x + z + I(z**2)", df).fit()
model.params

newdf = pd.DataFrame({"z": [4,4,4,5,5,5],
                   "x":[0,1,2,0,1,2]
                   })
model.predict(newdf)

You'll notice, if you run this, that model.params is a pandas Series with indices the same as the right-hand side of the formula, except with an additional entry: "Intercept"

>  Out[2]: 
>     Intercept   -2.0
>     x            2.0
>     z            1.5
>     I(z ** 2)    0.5
>     dtype: float64

And, using some internal functionality I can't determine, the RegressionResults object's .predict() can recognize the column headers from newdf, match them up (including the patsy syntax "I(z**2)"), add the intercept, and return an answer Series. (this is the last line of my sample code)

This seems convenient! Better than writing out my formula again in python/numpy code whenever I want to evaluate slight variations on it. I feel like there should be some way for me to construct a similar pd.Series for formula coefficients, instead of having created it through a model and fit. Then I should be able to apply this to an appropriate dataframe as a way of evaluating functions.

My attempts to figure out how statsmodel is doing this haven't worked, I haven't found anything obvious in the related function doc pages, in patsy, nor can I seem to enter this section of the source code while debugging. Anyone have any idea how to set this up?

4

1 回答 1

1

我最终拼凑出一种方法来做到这一点。

def predict(coeffs,datadf:pd.DataFrame)->np.array:
    """Apply a series (or df) of coefficents indexed by model terms to new data

    :param coeffs: a series whose elements are coefficients and index are the formula terms
                or a df whose column names are formula terms, and each row is a set of coefficients
    :param datadf: the new data to predict on
    """
    desc = patsy.ModelDesc([],[patsy.Term([]) if column=="Intercept" else patsy.Term([patsy.EvalFactor(column)]) for column in coeffs.index] )

    dmat = patsy.dmatrix(desc,datadf)
    return np.dot(dmat, coeffs.T)

newdf["y"] = predict(model.params,newdf)

如果有人感到困惑,这对我来说似乎如此吸引人的原因是我正在使用df.groupby("column").apply(FitFunction). 似乎FitFunction()回归该model.params系列将是熊猫范式中最干净的方法。

于 2018-09-14T19:28:43.333 回答