我正在尝试dask_ml.xgboost
使用eval_set
以允许提前停止以避免过度拟合。
目前,我有一个示例数据集,如下例所示
from dask.distributed import Client
from dask_ml.datasets import make_classification_df
from dask_ml.xgboost import XGBClassifier
if __name__ == "__main__":
n_train_rows = 4_000
n_val_rows = 1_000
client = Client()
print(client)
# Generate balanced data for binary classification
X_train, y_train = make_classification_df(
n_samples=n_train_rows,
chunks=100,
predictability=0.35,
n_features=50,
random_state=2,
)
X_val, y_val = make_classification_df(
n_samples=n_val_rows,
chunks=100,
predictability=0.35,
n_features=50,
random_state=2,
)
clf = XGBClassifier(objective="binary:logistic")
# train
clf.fit(
X_train,
y_train,
eval_metric="error",
eval_set=[
(X_train.compute(), y_train.compute()),
(X_val.compute(), y_val.compute()),
],
early_stopping_rounds=5,
)
# Make predictions
y_pred = clf.predict(X_val).compute()
assert len(y_pred) == len(y_val)
client.close()
所有X_train
,y_train
和都是X_val
dask 。y_val
DataFrame
我不能指定eval_set
为 daskDataFrame
的嵌套列表,使用eval_set=[(X_train.compute(), y_train.compute()), (X_val.compute(), y_val.compute())]
. 相反,它们需要是 pandas DataFrame
,这就是为什么我需要调用.compute()
它们中的每一个。
但是,当我运行上面的代码(使用 pandas DataFrame
)时,我收到了这个警告
<Client: 'tcp://127.0.0.1:12345' processes=4 threads=12, memory=16.49 GB>
/home/username/.../distributed/worker.py:3373: UserWarning: Large object of size 2.16 MB detected in task graph:
{'dmatrix_kwargs': {}, 'num_boost_round': 100, 'ev ... ing_rounds': 5}
Consider scattering large objects ahead of time
with client.scatter to reduce scheduler burden and
keep data on workers
future = client.submit(func, big_data) # bad
big_future = client.scatter(big_data) # good
future = client.submit(func, big_future) # good
warnings.warn(
task NULL connected to the tracker
task NULL connected to the tracker
task NULL connected to the tracker
task NULL connected to the tracker
task NULL got new rank 0
task NULL got new rank 1
task NULL got new rank 2
task NULL got new rank 3
[08:52:41] WARNING: ../src/gbm/gbtree.cc:129: Tree method is automatically selected to be 'approx' for distributed training.
[08:52:41] WARNING: ../src/gbm/gbtree.cc:129: Tree method is automatically selected to be 'approx' for distributed training.
[08:52:41] WARNING: ../src/gbm/gbtree.cc:129: Tree method is automatically selected to be 'approx' for distributed training.
[08:52:41] WARNING: ../src/gbm/gbtree.cc:129: Tree method is automatically selected to be 'approx' for distributed training.
此代码一直运行到完成并生成预测。但是,生产estimator.fit(...)
线生产这个UserWarning
。
补充说明
- 在我的用例中,此处示例中使用的训练和验证拆分中的行数反映了从整体数据中采样后的大小。不幸的是,训练(+超参数调整)所需的整体数据拆分要
dask_ml.xgboost
大几个数量级(在行数中,基于训练和验证学习曲线,根据dask_ml
建议,使用标准XGBoost
(使用from xgboost import XGBClassifier
)生成dask_ml
而没有XGBoost
( 1 , 2 )) 所以我无法计算这些并将它们作为 pandas 带入内存以DataFrame
进行分布式XGBoost
训练。 - 此处示例中使用的功能数量为 50。(在实际用例中)我在删除了尽可能多的功能后得出了这个数字。
- 代码在本地机器上运行。
问题
是否有正确/推荐的方法来运行由 Dask 组成dask_ml
的xgboost
run ?eval_set
DataFrame
编辑
请注意,训练拆分也被传入eval_set
(除了验证拆分),目的是使用模型训练的输出生成学习曲线(参见此处)。