我正在尝试对 PyTorch 中的 ResNet-18 模型进行一些更改,以调用另一个辅助训练模型的执行,该模型将每个 ResNet 块末尾的 ResNet 中间层输出作为输入,并在推理期间进行一些辅助预测相。
我希望能够在一个块的计算之后与下一个 ResNet 块的计算并行进行辅助计算,以减少整个流水线在 GPU 上执行的端到端延迟。
从功能的角度来看,我有一个可以正常工作的基本代码,但是辅助模型的执行与 ResNet 块的计算是串行的。我通过两种方式验证了这一点 -
通过检测原始 ResNet 模型(例如时间 t1)和辅助模型(例如时间 t2)的运行时间。我的执行时间目前是 t1+t2。
原始 ResNet 块代码(这是 BasicBlock,因为我正在尝试 ResNet-18)。完整代码可在此处获得
class BasicBlock(nn.module):
def forward(self, x):
residual = x
out = self.conv1(x)
out = self.bn1(out)
out = self.relu(out)
out = self.conv2(out)
out = self.bn2(out)
if self.downsample is not None:
residual = self.downsample(x)
out += residual
out = self.relu(out)
return out
def forward(self, x):
if len(x[0]) == self.auxiliary_prediction_size: # Got an Auxiliary prediction earlier
return x
# Do usual block computation
residual = x
out = self.conv1(x)
out = self.bn1(out)
out = self.relu(out)
out = self.conv2(out)
out = self.bn2(out)
if self.downsample is not None:
residual = self.downsample(x)
out += residual
out = self.relu(out)
# Try to make an auxiliary prediction
# First flatten the tensor (also assume for now that batch size is 1)
batchSize = x.shape[0]
intermediate_output = out.view(batchSize, -1)
# Place the flattened on GPU
device = torch.device("cuda:0")
input = intermediate_output.to(device)
# Make auxiliary prediction
auxiliary_input = out.float()
auxiliary_prediction = self.auxiliary_model(auxiliary_input)
if auxiliary_prediction meets some condition:
return auxiliary_prediction
# If no auxiliary prediction, then return intermediate output
return out
batchSize = x.shape[0]
intermediate_output = out.view(batchSize, -1)
# Place the flattened on GPU
device = torch.device("cuda:0")
input = intermediate_output.to(device)
# Make auxiliary prediction
auxiliary_input = out.float()
auxiliary_prediction = self.auxiliary_model(auxiliary_input)
if auxiliary_prediction meets some condition:
# Comment out return to break data dependency
#return auxiliary_prediction
# If no auxiliary prediction, then return intermediate output
return out
然而,这并没有奏效,在进一步研究后,我在Stack Overflow 链接上偶然发现了 CUDA 流。我尝试通过以下方式结合 CUDA 流的想法来解决我的问题 -
def forward(self, x):
if len(x[0]) == self.auxiliary_prediction_size: # Got an Auxiliary prediction earlier
return x
s1 = torch.cuda.Stream()
s2 = torch.cuda.Stream()
with torch.cuda.Stream(s1):
# Do usual block computation
residual = x
out = self.conv1(x)
out = self.bn1(out)
out = self.relu(out)
out = self.conv2(out)
out = self.bn2(out)
if self.downsample is not None:
residual = self.downsample(x)
out += residual
out = self.relu(out)
with torch.cuda.Stream(s2):
# Try to make an auxiliary prediction
# First flatten the tensor (also assume for now that batch size is 1)
out_detach = out.detach() # Detach from backprop flow and from computational graph dependency
batchSize = x.shape[0]
intermediate_output = out_detach.view(batchSize, -1)
# Place the flattened on GPU
device = torch.device("cuda:0")
input = intermediate_output.to(device)
# Make auxiliary prediction
auxiliary_input = out_detach.float()
auxiliary_prediction = self.auxiliary_model(auxiliary_input)
if auxiliary_prediction meets some condition:
return auxiliary_prediction
# If no auxiliary prediction, then return intermediate output
return out
但是,Nvidia Visual Profiler 的输出仍然表明所有工作仍在默认流上完成,并且仍在序列化。请注意,我确实使用小型 CUDA 程序验证了我正在使用的 CUDA 版本支持 CUDA 流。
为什么打破数据依赖性不会导致 PyTorch 并行调度计算?我认为这是 PyTorch 中动态计算图的重点。
为什么使用 CUDA 流不会将计算委托给非默认流?
是否有替代方法可以与 ResNet 块计算异步/并行执行辅助模型?