0

我正在尝试使用 GPU 和 TuriCreate 在 Google Colab 上训练对象检测模型。

根据 TuriCreate 的存储库,要在训练期间使用 gpu,您必须遵循以下说明:

https://github.com/apple/turicreate/blob/main/LinuxGPU.md

但是,每次我开始训练时,shell 都会在开始训练之前生成以下输出:

"Using CPU to create model."

我的 colab 的结构如下:

搭建cuda环境

!wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
!sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
!sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
!sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /"
!sudo apt-get update

!wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb

!sudo apt install ./nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
!sudo apt-get update

!wget https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/libnvinfer7_7.1.3-1+cuda11.0_amd64.deb
!sudo apt install ./libnvinfer7_7.1.3-1+cuda11.0_amd64.deb
!sudo apt-get update

# Install development and runtime libraries (~4GB)
!sudo apt-get install --no-install-recommends \
    cuda-11-0 \
    libcudnn8=8.0.4.30-1+cuda11.0  \
    libcudnn8-dev=8.0.4.30-1+cuda11.0

# Install TensorRT. Requires that libcudnn8 is installed above.
!sudo apt-get install -y --no-install-recommends libnvinfer7=7.1.3-1+cuda11.0 \
    libnvinfer-dev=7.1.3-1+cuda11.0 \
    libnvinfer-plugin7=7.1.3-1+cuda11.0

tc.config.set_num_gpus(-1)
model = tc.object_detector.create(train_sf)
scores = model.evaluate(valid_sf)
print(scores['mean_average_precision'])
model.export_coreml('model.mlmodel')

使用 nvidia-smi 检查安装

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P8    27W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

依赖安装

!pip install turicreate
!pip uninstall -y tensorflow
!pip install tensorflow-gpu 

设置 bash 环境变量

!echo export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH >> ~/.bashrc

训练

tc.config.set_num_gpus(-1)
model = tc.object_detector.create(train_sf)
scores = model.evaluate(valid_sf)
print(scores['mean_average_precision'])
model.export_coreml('model.mlmodel')

这是输出

TuriCreate currently only supports using one GPU. Setting 'num_gpus' to 1.
Using 'image' as feature column
Using 'annotations' as annotations column

Using CPU to create model.

Setting 'batch_size' to 32

我无法理解我错过了什么。

4

1 回答 1

0

我设法解决了这个问题:问题是由于 colab 机器上预装的 tensorflow 版本造成的。

!pip uninstall -y tensorflow
!pip uninstall -y tensorflow-gpu
!pip install turicreate
!pip install tensorflow==2.4.0
于 2021-09-02T08:55:56.620 回答