machine-learning - 使用 PyTorch 与模型的正向传递并行执行另一个模型

Question

我正在尝试对 PyTorch 中的 ResNet-18 模型进行一些更改，以调用另一个辅助训练模型的执行，该模型将每个 ResNet 块末尾的 ResNet 中间层输出作为输入，并在推理期间进行一些辅助预测相。

我希望能够在一个块的计算之后与下一个 ResNet 块的计算并行进行辅助计算，以减少整个流水线在 GPU 上执行的端到端延迟。

从功能的角度来看，我有一个可以正常工作的基本代码，但是辅助模型的执行与 ResNet 块的计算是串行的。我通过两种方式验证了这一点 -

通过添加打印语句并验证执行顺序。
通过检测原始 ResNet 模型（例如时间 t1）和辅助模型（例如时间 t2）的运行时间。我的执行时间目前是 t1+t2。

原始 ResNet 块代码（这是 BasicBlock，因为我正在尝试 ResNet-18）。完整代码可在此处获得

  class BasicBlock(nn.module):
  ...

      def forward(self, x):

        residual = x
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)
        out = self.conv2(out)
        out = self.bn2(out)

        if self.downsample is not None:
          residual = self.downsample(x)

        out += residual
        out = self.relu(out)
        return out

这是我的修改，它以串行方式工作-

def forward(self, x):
    if len(x[0]) == self.auxiliary_prediction_size: # Got an Auxiliary prediction earlier
        return x

    # Do usual block computation
    residual = x
    out = self.conv1(x)
    out = self.bn1(out)
    out = self.relu(out)
    out = self.conv2(out)
    out = self.bn2(out)

    if self.downsample is not None:
        residual = self.downsample(x)

    out += residual
    out = self.relu(out)

    # Try to make an auxiliary prediction
    # First flatten the tensor (also assume for now that batch size is 1)
    batchSize = x.shape[0]
    intermediate_output = out.view(batchSize, -1)
    # Place the flattened on GPU
    device = torch.device("cuda:0")
    input = intermediate_output.to(device)
    # Make auxiliary prediction
    auxiliary_input = out.float()
    auxiliary_prediction = self.auxiliary_model(auxiliary_input)
    if auxiliary_prediction meets some condition:
      return auxiliary_prediction

    # If no auxiliary prediction, then return intermediate output
    return out

可以理解的是，上面的代码会导致辅助模型的执行与下一个块之间存在数据依赖关系，因此事情会连续发生。我尝试的第一个解决方案是检查打破这种数据依赖性是否会减少延迟。我尝试通过允许辅助模型执行但在满足条件时不让辅助预测返回来这样做（请注意，这会破坏功能，但这个实验纯粹是为了理解行为）。本质上，我所做的是——

    batchSize = x.shape[0]
    intermediate_output = out.view(batchSize, -1)
    # Place the flattened on GPU
    device = torch.device("cuda:0")
    input = intermediate_output.to(device)
    # Make auxiliary prediction
    auxiliary_input = out.float()
    auxiliary_prediction = self.auxiliary_model(auxiliary_input)
    if auxiliary_prediction meets some condition:
      # Comment out return to break data dependency
      #return auxiliary_prediction

    # If no auxiliary prediction, then return intermediate output
    return out

然而，这并没有奏效，在进一步研究后，我在Stack Overflow 链接上偶然发现了 CUDA 流。我尝试通过以下方式结合 CUDA 流的想法来解决我的问题 -

def forward(self, x):
    if len(x[0]) == self.auxiliary_prediction_size: # Got an Auxiliary prediction earlier
        return x

    s1 = torch.cuda.Stream()
    s2 = torch.cuda.Stream()

    with torch.cuda.Stream(s1):
      # Do usual block computation
      residual = x
      out = self.conv1(x)
      out = self.bn1(out)
      out = self.relu(out)
      out = self.conv2(out)
      out = self.bn2(out)

      if self.downsample is not None:
        residual = self.downsample(x)

      out += residual
      out = self.relu(out)

    with torch.cuda.Stream(s2):
      # Try to make an auxiliary prediction
      # First flatten the tensor (also assume for now that batch size is 1)
      out_detach = out.detach() # Detach from backprop flow and from computational graph dependency
      batchSize = x.shape[0]
      intermediate_output = out_detach.view(batchSize, -1)
      # Place the flattened on GPU
      device = torch.device("cuda:0")
      input = intermediate_output.to(device)
      # Make auxiliary prediction
      auxiliary_input = out_detach.float()
      auxiliary_prediction = self.auxiliary_model(auxiliary_input)
      if auxiliary_prediction meets some condition:
        return auxiliary_prediction

    # If no auxiliary prediction, then return intermediate output
    return out

但是，Nvidia Visual Profiler 的输出仍然表明所有工作仍在默认流上完成，并且仍在序列化。请注意，我确实使用小型 CUDA 程序验证了我正在使用的 CUDA 版本支持 CUDA 流。

我的问题-

为什么打破数据依赖性不会导致 PyTorch 并行调度计算？我认为这是 PyTorch 中动态计算图的重点。
为什么使用 CUDA 流不会将计算委托给非默认流？
是否有替代方法可以与 ResNet 块计算异步/并行执行辅助模型？

machine-learning - 使用 PyTorch 与模型的正向传递并行执行另一个模型

0 回答 0

Related

Reference