1

我正在使用sklearnmlxtend.regressor.StackingRegressor构建一个堆叠回归模型。例如,假设我想要以下小管道:

  1. 具有两个回归量的堆叠回归量:
    • 一个管道,它:
      • 执行数据插补
      • 1-hot 编码分类特征
      • 执行线性回归
    • 一个管道,它:
      • 执行数据插补
      • 使用决策树执行回归

不幸的是,这是不可能的,因为StackingRegressor不接受NaN其输入数据。即使它的回归器知道如何处理NaN也是如此,因为在我的情况下,回归器实际上是执行数据插补的管道。

但是,这不是问题:我可以将数据插补移到堆叠回归器之外。现在我的管道看起来像这样:

  1. 执行数据插补
  2. 应用具有两个回归量的堆叠回归量:
    • 一个管道,它:
      • 1-hot 编码分类特征
      • 标准化数字特征
      • 执行线性回归
    • 一个sklearn.tree.DecisionTreeRegressor

可以尝试按如下方式实现它(本要点中的整个最小工作示例,带有注释):

sr_linear = Pipeline(steps=[
    ('preprocessing', ColumnTransformer(transformers=[
        ('categorical',
             make_pipeline(OneHotEncoder(), StandardScaler()),
             make_column_selector(dtype_include='category')),
        ('numerical',
             StandardScaler(),
             make_column_selector(dtype_include=np.number))
    ])),
    ('model', LinearRegression())
])

sr_tree = DecisionTreeRegressor()

ct_imputation = ColumnTransformer(transformers=[
    ('categorical',
        SimpleImputer(strategy='constant', fill_value='None'),
        make_column_selector(dtype_include='category')),
    ('numerical',
        SimpleImputer(strategy='median'),
        make_column_selector(dtype_include=np.number))
])

stacked_regressor = Pipeline(steps=[
    ('imputation', ct_imputation),
    ('back_to_pandas', FunctionTransformer(
        func=lambda values: pd.DataFrame(values, columns=ct_imputation.get_feature_names_out())
    )),
    ('model', StackingRegressor(
        regressors=[sr_linear, sr_tree],
        meta_regressor=DecisionTreeRegressor(),
        use_features_in_secondary=True
    ))
])

请注意,“外部” ColumnTransformer(in stacked_regressor)返回一个numpy矩阵。但是“内部” ColumnTransformer(in sr_linear)需要 a pandas.DataFrame,所以我不得不使用 step 将矩阵转换回数据框back_to_pandas。(要使用get_feature_names_out我必须使用 sklearn 的 nightly 版本,因为当前稳定的 1.0.2 版本还不支持它。幸运的是它可以通过一个简单的命令安装。)

上面的代码在调用时失败stacked_regressor.fit(),出现以下错误(整个堆栈跟踪再次在gist中):

ValueError: make_column_selector can only be applied to pandas dataframes

但是,因为我在back_to_pandas外部管道中添加了该步骤,所以内部管道应该得到一个 pandas 数据框!事实上,如果我只是fit_transform()我的ct_imputation对象,我清楚地获得了一个熊猫数据框。我无法理解传递的数据究竟在何时何地不再是数据框。为什么我的代码失败了?

4

2 回答 2

1

海事组织这个问题必须归因于StackingRegressor。实际上,我不是其用法方面的专家,我仍然没有探索它的源代码,但我发现了这个sklearn 问题 - #16473这似乎暗示<< [regressors 和 meta_regressors] 的串联不保留数据帧 > >(虽然这是指sklearn StackingRegressor实例,而不是mlxtend一个)。

确实,看看用sr_linear管道替换它后会发生什么:

from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split

from mlxtend.regressor import StackingRegressor

import numpy as np
import pandas as pd

# We use the Ames house prices dataset for this example
d = fetch_openml('house_prices', as_frame=True).frame

# Small data preprocessing:
for column in d.columns:
    if d[column].dtype == object or column == 'MSSubClass':
        d[column] = pd.Categorical(d[column])
    
d.drop(columns='Id', inplace=True)

# Prepare the data for training
label = 'SalePrice'
features = [col for col in d.columns if col != label]
X, y = d[features], d[label]

# Train the stacked regressor
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True)

sr_linear = Pipeline(steps=[
('preprocessing', ColumnTransformer(transformers=[
    ('categorical',
         make_pipeline(OneHotEncoder(), StandardScaler(with_mean=False)),
         make_column_selector(dtype_include='category')),
     ('numerical',
         StandardScaler(),
         make_column_selector(dtype_include=np.number))
    ])),
    ('model', LinearRegression())
])

ct_imputation = ColumnTransformer(transformers=[
    ('categorical',
        SimpleImputer(missing_values=np.nan, strategy='constant', fill_value='None'),
        make_column_selector(dtype_include='category')),
    ('numerical',
        SimpleImputer(strategy='median'),
        make_column_selector(dtype_include=np.number))
])

stacked_regressor = Pipeline(steps=[
    ('imputation', ct_imputation),
    ('back_to_pandas', FunctionTransformer(
        func=lambda values: pd.DataFrame(values, columns=ct_imputation.get_feature_names_out()).astype(types)
    )),
    ('mdl', sr_linear)
])

stacked_regressor.fit(X_train, y_train)

请注意,我不得不稍微修改该'back_to_pandas'步骤,因为由于某种原因将列的pd.DataFrame更改为only (from and ),因此与在. 为此,我申请了构造函数,其中定义如下(基于dev版本的方法实现):dtypes'object''category''float64'sr_linear.astype(types)pd.DataFrametypes.get_feature_names_out()SimpleImputersklearn

types = {} 
for col in d.columns[:-1]: 
    if d[col].dtype == 'category':
        types['categorical__' + col] = str(d[col].dtype)
    else:
        types['numerical__' + col] = str(d[col].dtype)
于 2022-02-18T21:24:46.630 回答
1

正确的做法是:

  1. mlxtend's 移动到sklearn's StackingRegressor。我相信前者在sklearn仍然没有堆叠回归器时是创造者。现在没有必要使用更多“晦涩”的解决方案了。sklearn的堆叠回归器效果很好。
  2. 将 1-hot-encoding 步骤移至外部管道,因为(令人惊讶!)无法处理特征之间的分类数据sklearnDecisionTreeRegressor

代码的工作版本如下:

from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.ensemble import StackingRegressor

import numpy as np
import pandas as pd

def set_correct_categories(df: pd.DataFrame) -> pd.DataFrame:
    for column in df.columns:
        if df[column].dtype == object or 'MSSubClass' in column:
            df[column] = pd.Categorical(df[column])

    return df

d = fetch_openml('house_prices', as_frame=True).frame
d = set_correct_categories(d).drop(columns='Id')

sr_linear = Pipeline(steps=[
    ('preprocessing', StandardScaler()),
    ('model', LinearRegression())
])

ct_preprocessing = ColumnTransformer(transformers=[
    ('categorical',
        make_pipeline(
            SimpleImputer(strategy='constant', fill_value='None'),
            OneHotEncoder(sparse=False, handle_unknown='ignore')
        ),
        make_column_selector(dtype_include='category')),
    ('numerical',
        SimpleImputer(strategy='median'),
        make_column_selector(dtype_include=np.number))
])

stacking_regressor = Pipeline(steps=[
    ('preprocessing', ct_preprocessing),
    ('model', StackingRegressor(
        estimators=[('linear_regression', sr_linear), ('regression_tree', DecisionTreeRegressor())],
        final_estimator=DecisionTreeRegressor(),
        passthrough=True
    ))
])

label = 'SalePrice'
features = [col for col in d.columns if col != label]
X, y = d[features], d[label]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True)

stacking_regressor.fit(X_train, y_train)

感谢用户 amiola的回答让我走上了正轨。

于 2022-02-19T19:40:43.263 回答