deep-learning - Pytorch混合精度学习，torch.cuda.amp运行比正常慢

Question

我试图从属性中resnet18存在的正常模型推断结果。torchvision.models该模型仅在FP32上进行了简单的训练，没有任何混合精度学习。但是，我想在推理时获得更快的结果，所以我torch.cuda.amp.autocast()只在运行测试推理用例时启用了函数。

下面给出了相同的代码 -

model = torchvision.models.resnet18()
model = model.to(device) # Pushing to GPU

# Train the model normally

没有amp-

tensor = torch.rand(1,3,32,32).to(device) # Random tensor for testing
with torch.no_grad():
  model.eval()
  start = torch.cuda.Event(enable_timing=True)
  end = torch.cuda.Event(enable_timing=True)
  model(tensor) # warmup
  model(tensor) # warmpup
  start.record()
  for i in range(20): # total time over 20 iterations 
    model(tensor)
  end.record()
  torch.cuda.synchronize()
    
  print('execution time in milliseconds: {}'. format(start.elapsed_time(end)/20))

  execution time in milliseconds: 5.264944076538086

与amp-

tensor = torch.rand(1,3,32,32).to(device)
with torch.no_grad():
  model.eval()
  start = torch.cuda.Event(enable_timing=True)
  end = torch.cuda.Event(enable_timing=True)
  model(tensor)
  model(tensor)

  start.record()
  with torch.cuda.amp.autocast(): # autocast initialized
    for i in range(20):
      model(tensor)
  end.record()
  torch.cuda.synchronize()
  
  print('execution time in milliseconds: {}'. format(start.elapsed_time(end)/20))

  execution time in milliseconds: 10.619884490966797

显然，autocast()启用的代码需要双倍的时间。甚至，对于较大的模型，如resnet50，时间变化也大致相同。

有人可以帮我解决这个问题吗？我在Google Colab上运行这个例子，下面是 GPU 的规格

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.27       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   43C    P0    28W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

torch.version.cuda == 10.1
torch.__version__  == 1.8.1+cu101

score 1 · Accepted Answer

这很可能是因为您使用的 GPU - P100，它有 3584 个 CUDA 核心，但有 0 个张量核心 - 后者通常在混合精度加速中起主要作用。您可能想快速浏览一下本文的“硬件比较”部分。

如果您坚持使用 Colab，我可以预见可能加速的唯一方法是为您分配一个具有张量核心的 T4。

此外，您似乎只使用单个图像/批量大小为 1。如果您获得 T4，请尝试使用更大的批量大小（例如 32-64-128-256 等）重新运行基准测试。当您对批次进行并行化时，您应该能够注意到更多明显的改进。

deep-learning - Pytorch混合精度学习，torch.cuda.amp运行比正常慢

1 回答 1

Related

Reference