python-3.x - 为什么在训练 YOLOv5 模型时出现运行时错误？

Question

回溯（最后一次调用）：文件“train.py”，第 519 行，在 train(hyp, opt, device, tb_writer, wandb) 文件“train.py”，第 300 行，在 train scaler.scale(loss) 中。后向（）文件“/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/_tensor.py”，第 255 行，在后向 torch.autograd.backward(self, gradient, retain_graph，create_graph，inputs=inputs）文件“/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/autograd/init .py ”，第149行，向后allow_unreachable=True , accumulate_grad=True) #allow_unreachable flag RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED 您可以尝试使用以下代码片段重现此异常。如果这不会触发错误，

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([8, 64, 80, 80], dtype=torch.half, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(64, 64, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().half()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams 
    data_type = CUDNN_DATA_HALF
    padding = [1, 1, 0]
    stride = [1, 1, 0]
    dilation = [1, 1, 0]
    groups = 1
    deterministic = false
    allow_tf32 = true
input: TensorDescriptor 0x7efbbc25f670
    type = CUDNN_DATA_HALF
    nbDims = 4
    dimA = 8, 64, 80, 80, 
    strideA = 409600, 6400, 80, 1, 
output: TensorDescriptor 0x7efbbc27c890
    type = CUDNN_DATA_HALF
    nbDims = 4
    dimA = 8, 64, 80, 80, 
    strideA = 409600, 6400, 80, 1, 
weight: FilterDescriptor 0x7efbbc2200c0
    type = CUDNN_DATA_HALF
    tensor_format = CUDNN_TENSOR_NCHW
    nbDims = 4
    dimA = 64, 64, 3, 3, 
Pointer addresses: 
    input: 0x7dc4580000
    output: 0x7c6c720000
    weight: 0x7d4674a000
Additional pointer addresses: 
    grad_output: 0x7c6c720000
    grad_weight: 0x7d4674a000
Backward filter algorithm: 5`

python-3.x - 为什么在训练 YOLOv5 模型时出现运行时错误？

0 回答 0

Related

Reference