1

我即将使用 sklearn 的管道模块创建一个简单的模型管道:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LogisticRegression

class TextTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, key):
        self.key = key

    def fit(self, X, y=None, *parg, **kwarg):
        return self

    def transform(self, X):
        return X[self.key]

X_train, X_test, y_train, y_test = train_test_split(df[['text']], 
                                                        df['target'],
                                                        test_size=0.20) 
feature_pipeline =  Pipeline([
            ('transformer', TextTransformer(key='text')),
            ('tfidf', TfidfVectorizer(ngram_range=(1,1))),
            ('svd', TruncatedSVD(algorithm='randomized', n_components=150))
            ])

pickle.dump(feature_pipeline, open(../".pkl", 'wb'))

初始化特征选择后feature_pipeline,我想从其他脚本调用几个模型上创建的管道,例如(以 SVC 为例):

svc_pipeline = Pipeline([('features', feature_pipeline),
                      ('SVC',SVC())
                      ])
parameter_grid = {'SVC__kernel':['linear','rbf'],
                      'SVC__C':loguniform(1e-1, 1e2), 
                      'SVC__gamma':loguniform(1e-3, 1e0)}
svc_pipeline = RandomizedSearchCV(svc_pipeline, parameter_grid, n_iter = 10, cv=5,n_jobs = -1))

svc_pipeline.fit(X_train, y_train)
predictions = svc_pipeline.predict(X_test)

到目前为止,我feature_pipeline已经以相同的方式初始化了每个模型,并且我假设必须分别为每个模型初始化?!

作为一种解决方法,我目前腌制拟合feature_pipeline、变换和存储 X_train、X_test 矩阵,例如:

feature_pipeline.fit(X_train)
pickle.dump(feature_pipeline, open(../".pkl", 'wb'))

X_train_transformed = feature_pipeline.transform(X_train)
pickle.dump(X_train_transformed, open(../".pkl", 'wb'))

X_test_transformed = feature_pipeline.transform(X_test)
pickle.dump(X_test_transformed, open(../".pkl", 'wb'))


有没有办法feature_pipeline直接保存,以便在不需要应用此解决方法的情况下使用它?还是有其他最佳实践?

4

0 回答 0