python - 并行运行 python vaex.ml.catboost.CatBoostModel.fit 的正确方法是什么？

Question

描述

我有一个 python 代码顺序调用vaex.ml.catboost.CatBoostModel.fit3 折。这需要很多时间，我想vaex.ml.catboost.CatBoostModel.fit并行运行。

问题

当我vaex.ml.catboost.CatBoostModel.fit按顺序和并行运行时，我会得到不同的结果。当然，我做错了。我希望并行结果非常接近顺序结果（种子不是硬编码的，所以总会有一些小的波动）。顺序和并行版本产生绝对无与伦比的结果。

这是顺序代码。它产生approved result

estimator = CatBoostModel(
        features=features + features_cat,
        target=target,
        num_boost_round=700,
        prediction_name="catboost_prediction",
        prediction_type=prediction_type
    )
 
for fold in folds:
    logging.info(f"training fold: {fold}")  # 1,2,3
    df_train = df[df.cv_fold != fold]
    df_val = df[df.cv_fold == fold]
    estimator.fit(df=df_train, evals=[df_val], early_stopping_rounds=100, verbose_eval=True)
    cv_scores[cv_fold == fold] = estimator.predict(df_val)

这是我的并行代码：

import concurrent.futures
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    future_to_result = {executor.submit(train_fold, fold, cv_scores, df, task, estimator): fold for fold in
                        folds}
    for future in concurrent.futures.as_completed(future_to_result):
        res = future_to_result[future]
        (fold, result) = future.result()
        logging.info(f"completed future for {fold}, result: {result.shape}")
        cv_scores[cv_fold == fold] = result

def train_fold(fold,
               cv_scores,
               df, estimator: CatBoostModel):
    logging.info(f"training fold: {fold}")
    df_train = df[df.cv_fold != fold]
    df_val = df[df.cv_fold == fold]
    estimator.fit(df=df_train, evals=[df_val], early_stopping_rounds=100, verbose_eval=True)
    result = estimator.predict(df_val)

    return (fold, result)

python - 并行运行 python vaex.ml.catboost.CatBoostModel.fit 的正确方法是什么？

0 回答 0

Related

Reference