如何在python中执行逐步回归?SCIPY 中有用于 OLS 的方法,但我无法逐步进行。在这方面的任何帮助将是一个很大的帮助。谢谢。
编辑:我正在尝试建立一个线性回归模型。我有 5 个自变量并使用正向逐步回归,我的目标是选择变量,使我的模型具有最低的 p 值。以下链接解释了目标:
再次感谢。
如何在python中执行逐步回归?SCIPY 中有用于 OLS 的方法,但我无法逐步进行。在这方面的任何帮助将是一个很大的帮助。谢谢。
编辑:我正在尝试建立一个线性回归模型。我有 5 个自变量并使用正向逐步回归,我的目标是选择变量,使我的模型具有最低的 p 值。以下链接解释了目标:
再次感谢。
Trevor Smith 和我使用 statsmodels 为线性回归编写了一个小前向选择函数:http://planspace.org/20150423-forward_selection_with_statsmodels/您可以轻松修改它以最小化 p 值,或仅使用 beta p 值进行选择多做一点工作。
你可以试试 mlxtend,它有多种选择方法。
from mlxtend.feature_selection import SequentialFeatureSelector as sfs
clf = LinearRegression()
# Build step forward feature selection
sfs1 = sfs(clf,k_features = 10,forward=True,floating=False, scoring='r2',cv=5)
# Perform SFFS
sfs1 = sfs1.fit(X_train, y_train)
You can make forward-backward selection based on statsmodels.api.OLS
model, as shown in this answer.
However, this answer describes why you should not use stepwise selection for econometric models in the first place.
Statsmodels has additional methods for regression: http://statsmodels.sourceforge.net/devel/examples/generated/example_ols.html. I think it will help you to implement stepwise regression.
我开发了这个存储库https://github.com/xinhe97/StepwiseSelectionOLS
我的逐步选择类(最佳子集、前向逐步、后向逐步)与 sklearn 兼容。你可以用我的类来做 Pipeline 和 GridSearchCV。
我的代码的基本部分如下:
################### Criteria ###################
def processSubset(self, X,y,feature_index):
# Fit model on feature_set and calculate rsq_adj
regr = sm.OLS(y,X[:,feature_index]).fit()
rsq_adj = regr.rsquared_adj
bic = self.myBic(X.shape[0], regr.mse_resid, len(feature_index))
rsq = regr.rsquared
return {"model":regr, "rsq_adj":rsq_adj, "bic":bic, "rsq":rsq, "predictors_index":feature_index}
################### Forward Stepwise ###################
def forward(self,predictors_index,X,y):
# Pull out predictors we still need to process
remaining_predictors_index = [p for p in range(X.shape[1])
if p not in predictors_index]
results = []
for p in remaining_predictors_index:
new_predictors_index = predictors_index+[p]
new_predictors_index.sort()
results.append(self.processSubset(X,y,new_predictors_index))
# Wrap everything up in a nice dataframe
models = pd.DataFrame(results)
# Choose the model with the highest rsq_adj
# best_model = models.loc[models['bic'].idxmin()]
best_model = models.loc[models['rsq'].idxmax()]
# Return the best model, along with model's other information
return best_model
def forwardK(self,X_est,y_est, fK):
models_fwd = pd.DataFrame(columns=["model", "rsq_adj", "bic", "rsq", "predictors_index"])
predictors_index = []
M = min(fK,X_est.shape[1])
for i in range(1,M+1):
print(i)
models_fwd.loc[i] = self.forward(predictors_index,X_est,y_est)
predictors_index = models_fwd.loc[i,'predictors_index']
print(models_fwd)
# best_model_fwd = models_fwd.loc[models_fwd['bic'].idxmin(),'model']
best_model_fwd = models_fwd.loc[models_fwd['rsq'].idxmax(),'model']
# best_predictors = models_fwd.loc[models_fwd['bic'].idxmin(),'predictors_index']
best_predictors = models_fwd.loc[models_fwd['rsq'].idxmax(),'predictors_index']
return best_model_fwd, best_predictors
"""Importing the api class from statsmodels"""
import statsmodels.formula.api as sm
"""X_opt variable has all the columns of independent variables of matrix X
in this case we have 5 independent variables"""
X_opt = X[:,[0,1,2,3,4]]
"""Running the OLS method on X_opt and storing results in regressor_OLS"""
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()
使用摘要方法,您可以在内核中检查变量的 p 值,写为 'P>|t|'。然后检查具有最高 p 值的变量。假设 x3 具有最高值,例如 0.956。然后从阵列中删除此列并重复所有步骤。
X_opt = X[:,[0,1,3,4]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()
重复这些方法,直到删除所有 p 值高于显着性值(例如 0.05)的列。最后,您的变量 X_opt 将具有 p 值小于显着性水平的所有最佳变量。
这是我刚刚编写的一种方法,它使用了“统计学习简介”中描述的“混合选择”。作为输入,它需要:
lm,一个 statsmodels.OLS.fit(Y,X),其中 X 是 n 个数组,其中 n 是数据点的数量,Y,其中 Y 是训练数据中的响应
curr_preds- 带有 ['const'] 的列表
potential_preds- 所有潜在预测变量的列表。还需要一个 pandas 数据框 X_mix,其中包含所有数据,包括“const”,以及与潜在预测变量对应的所有数据
托尔,可选。最大 pvalue,如果未指定,则为 .05
def mixed_selection (lm, curr_preds, potential_preds, tol = .05):
while (len(potential_preds) > 0):
index_best = -1 # this will record the index of the best predictor
curr = -1 # this will record current index
best_r_squared = lm.rsquared_adj # record the r squared of the current model
# loop to determine if any of the predictors can better the r-squared
for pred in potential_preds:
curr += 1 # increment current
preds = curr_preds.copy() # grab the current predictors
preds.append(pred)
lm_new = sm.OLS(y, X_mix[preds]).fit() # create a model with the current predictors plus an addional potential predictor
new_r_sq = lm_new.rsquared_adj # record r squared for new model
if new_r_sq > best_r_squared:
best_r_squared = new_r_sq
index_best = curr
if index_best != -1: # a potential predictor improved the r-squared; remove it from potential_preds and add it to current_preds
curr_preds.append(potential_preds.pop(index_best))
else: # none of the remaining potential predictors improved the adjust r-squared; exit loop
break
# fit a new lm using the new predictors, look at the p-values
pvals = sm.OLS(y, X_mix[curr_preds]).fit().pvalues
pval_too_big = []
# make a list of all the p-values that are greater than the tolerance
for feat in pvals.index:
if(pvals[feat] > tol and feat != 'const'): # if the pvalue is too large, add it to the list of big pvalues
pval_too_big.append(feat)
# now remove all the features from curr_preds that have a p-value that is too large
for feat in pval_too_big:
pop_index = curr_preds.index(feat)
curr_preds.pop(pop_index)