0

我正在尝试dask_ml.xgboost使用eval_set以允许提前停止以避免过度拟合。

目前,我有一个示例数据集,如下例所示

from dask.distributed import Client
from dask_ml.datasets import make_classification_df
from dask_ml.xgboost import XGBClassifier


if __name__ == "__main__":
    n_train_rows = 4_000
    n_val_rows = 1_000

    client = Client()
    print(client)

    # Generate balanced data for binary classification
    X_train, y_train = make_classification_df(
        n_samples=n_train_rows,
        chunks=100,
        predictability=0.35,
        n_features=50,
        random_state=2,
    )
    X_val, y_val = make_classification_df(
        n_samples=n_val_rows,
        chunks=100,
        predictability=0.35,
        n_features=50,
        random_state=2,
    )

    clf = XGBClassifier(objective="binary:logistic")

    # train
    clf.fit(
        X_train,
        y_train,
        eval_metric="error",
        eval_set=[
            (X_train.compute(), y_train.compute()),
            (X_val.compute(), y_val.compute()),
        ],
        early_stopping_rounds=5,
    )

    # Make predictions
    y_pred = clf.predict(X_val).compute()
    assert len(y_pred) == len(y_val)

    client.close()

所有X_train,y_train和都是X_valdask 。y_valDataFrame

我不能指定eval_set为 daskDataFrame的嵌套列表,使用eval_set=[(X_train.compute(), y_train.compute()), (X_val.compute(), y_val.compute())]. 相反,它们需要是 pandas DataFrame,这就是为什么我需要调用.compute()它们中的每一个。

但是,当我运行上面的代码(使用 pandas DataFrame)时,我收到了这个警告

<Client: 'tcp://127.0.0.1:12345' processes=4 threads=12, memory=16.49 GB>
/home/username/.../distributed/worker.py:3373: UserWarning: Large object of size 2.16 MB detected in task graph:
  {'dmatrix_kwargs': {}, 'num_boost_round': 100, 'ev ... ing_rounds': 5}
Consider scattering large objects ahead of time
with client.scatter to reduce scheduler burden and
keep data on workers

    future = client.submit(func, big_data)    # bad

    big_future = client.scatter(big_data)     # good
    future = client.submit(func, big_future)  # good
  warnings.warn(
task NULL connected to the tracker
task NULL connected to the tracker
task NULL connected to the tracker
task NULL connected to the tracker
task NULL got new rank 0
task NULL got new rank 1
task NULL got new rank 2
task NULL got new rank 3
[08:52:41] WARNING: ../src/gbm/gbtree.cc:129: Tree method is automatically selected to be 'approx' for distributed training.
[08:52:41] WARNING: ../src/gbm/gbtree.cc:129: Tree method is automatically selected to be 'approx' for distributed training.
[08:52:41] WARNING: ../src/gbm/gbtree.cc:129: Tree method is automatically selected to be 'approx' for distributed training.
[08:52:41] WARNING: ../src/gbm/gbtree.cc:129: Tree method is automatically selected to be 'approx' for distributed training.

此代码一直运行到完成并生成预测。但是,生产estimator.fit(...)线生产这个UserWarning

补充说明

  1. 在我的用例中,此处示例中使用的训练和验证拆分中的行数反映了从整体数据中采样后的大小。不幸的是,训练(+超参数调整)所需的整体数据拆分要dask_ml.xgboost大几个数量级(在行数中,基于训练和验证学习曲线,根据dask_ml建议,使用标准XGBoost(使用from xgboost import XGBClassifier)生成dask_ml而没有XGBoost( 1 , 2 )) 所以我无法计算这些并将它们作为 pandas 带入内存以DataFrame进行分布式XGBoost训练。
  2. 此处示例中使用的功能数量为 50。(在实际用例中)我在删除了尽可能多的功能后得出了这个数字。
  3. 代码在本地机器上运行。

问题

是否有正确/推荐的方法来运行由 Dask 组成dask_mlxgboostrun ?eval_setDataFrame

编辑

请注意,训练拆分也被传入eval_set(除了验证拆分),目的是使用模型训练的输出生成学习曲线(参见此处)。

4

0 回答 0