2

我想使用 Google colaboratory 在我的数据集上使用 GPU 训练 LightGBM(我还选择了运行时 Python3 和 GPU)。为此,我使用了以下代码块:

!apt-get -qq install --no-install-recommends nvidia-375
!apt-get -qq install --no-install-recommends nvidia-opencl-icd-375 nvidia-opencl-dev opencl-headers
#!apt-get update
!apt-get install --no-install-recommends git cmake build-essential libboost-dev libboost-system-dev libboost-filesystem-dev ocl-icd-libopencl1 ocl-icd-opencl-dev
!pip install -qq lightgbm --install-option=--gpu 

同样在笔记本中,我选择了设备 gpu:

clf = LGBMClassifier(
        n_estimators=10000,
        learning_rate=0.03,
        num_leaves=30,
        colsample_bytree=.8,
        subsample=.9,
        max_depth=7,
        reg_alpha=.1,
        reg_lambda=.1,
        min_split_gain=.01,
        min_child_weight=2,
        silent=-1,
        verbose=-1,
        device = 'gpu'
        #gpu_platform_id: '0'
        #gpu_device_id: '0'
        )

得到了这个:

LightGBMError                             Traceback (most recent call last)
<ipython-input-10-936c00d106e3> in <module>()
     50     clf.fit(trn_x, trn_y, 
     51             eval_set= [(trn_x, trn_y), (val_x, val_y)],
---> 52             eval_metric='auc', verbose=100, early_stopping_rounds=100  #30
     53            )
     54 

/usr/local/lib/python3.6/dist-packages/lightgbm/sklearn.py in fit(self, X, y, sample_weight, init_score, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks)
    673                                         verbose=verbose, feature_name=feature_name,
    674                                         categorical_feature=categorical_feature,
--> 675                                         callbacks=callbacks)
    676         return self
    677 

/usr/local/lib/python3.6/dist-packages/lightgbm/sklearn.py in fit(self, X, y, sample_weight, init_score, group, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_group, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks)
    467                               verbose_eval=verbose, feature_name=feature_name,
    468                               categorical_feature=categorical_feature,
--> 469                               callbacks=callbacks)
    470 
    471         if evals_result:

/usr/local/lib/python3.6/dist-packages/lightgbm/engine.py in train(params, train_set, num_boost_round, valid_sets, valid_names, fobj, feval, init_model, feature_name, categorical_feature, early_stopping_rounds, evals_result, verbose_eval, learning_rates, keep_training_booster, callbacks)
    178     # construct booster
    179     try:
--> 180         booster = Booster(params=params, train_set=train_set)
    181         if is_valid_contain_train:
    182             booster.set_train_data_name(train_data_name)

/usr/local/lib/python3.6/dist-packages/lightgbm/basic.py in __init__(self, params, train_set, model_file, silent)
   1303                 train_set.construct().handle,
   1304                 c_str(params_str),
-> 1305                 ctypes.byref(self.handle)))
   1306             # save reference to data
   1307             self.train_set = train_set

/usr/local/lib/python3.6/dist-packages/lightgbm/basic.py in _safe_call(ret)
     46     """
     47     if ret != 0:
---> 48         raise LightGBMError(_LIB.LGBM_GetLastError())
     49 
     50 

LightGBMError: b'No OpenCL device found'

我也试过这个解决方案在 Google Collab 上为 LightGBM 安装 GPU 支持,但没有任何改变

4

2 回答 2

1

我遵循了https://github.com/microsoft/LightGBM/issues/586的建议,但它并没有解决我的问题。事实证明,libnvidia-opencl.soLightGBM 库不知道该路径。libnvidia-opencl.so.1因此,在我的情况下,我将路径修改为/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.1. 然后它起作用了。

单线解决方案是:

mkdir -p /etc/OpenCL/vendors && \ echo "/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd

当然,您必须确保正确安装了 Nvidia 驱动程序。对于 Ubuntu 18.04,您可以按照此说明操作https://www.linuxbabe.com/ubuntu/install-nvidia-driver-ubuntu-18-04

于 2019-10-12T06:32:22.973 回答
0

运行另一个代码时出现同样的错误。我通过禁用 MIG 并重新启动机器来解决它。

sudo nvidia-smi -mig 0
sudo reboot
于 2022-01-13T12:45:14.393 回答