python - 具有多个块的海量数据（~22GB）的 Catboost 训练模型

Question

我正在尝试在 csv 文件中训练一个具有大约 22GB 数据的 CatboostClassifier，该文件有大约 50 列。我尝试在熊猫数据框中一次加载所有数据，但无法做到。无论如何，我可以在 catboost 中用多个数据帧块训练模型吗？

score 3 · Accepted Answer

Catboost 增量适合大型数据文件。

只要使用 CPU 和 init_model 作为拟合参数，就可以增量训练模型。以下是如何执行此操作的示例：

from catboost import CatBoostClassifier
import pandas as pd
from sklearn.model_selection import train_test_split

clf = CatBoostClassifier(task_type="CPU",
                     iterations=2000,
                     learning_rate=0.2,
                     max_depth=1)
chunk=pd.read_csv('BigDataFile.csv',chunksize=100000)
for i,ds in enumerate(chunk):
    W=ds.values
    X=W[:,:-1].astype(float)
    Y=W[:,-1].astype(int) 
    del W
    if i==0:
        X_train, X_val, Y_train, Y_val = train_test_split(X, Y,                                                          
                                                 train_size=0.80,
                                                 random_state=1234)
        del X,Y
        clf.fit(X_train, Y_train, 
                eval_set=(X_val, Y_val), 
    else:
        clf.fit(X, Y,      
                eval_set=(X_val, Y_val),
                init_model='model.cbm') 
    clf.save_model('model.cbm')         # save model so is loaded in the next step

你很高兴。仅适用于 CPU。不要使用快照文件或 best_model。只要您有数据，模型文件将被加载，并且训练将在初始步骤之后增量执行。

score -1 · Accepted Answer

我不确定，但您可以尝试模型中的 save_snapshot 和 snapshot_file 选项。目的是在中断时能够继续学习。

model = CatBoostClassifier(iterations=50, 
save_snapshot = True,
snapshot_file = 'model_binary_snapshot.model' 
random_seed=42)

它将模型保存在“model_binary_snapshot.model”下，您可以重新加载并继续学习。

model2 = CatBoostClassifier( )
model2.load_model('model_binary_snapshot.model')

python - 具有多个块的海量数据（~22GB）的 Catboost 训练模型

2 回答 2

Catboost 增量适合大型数据文件。

Related

Reference