nlp - 如何使用 RoBERTa ONNX 量化模型执行批量推理？

Question

我已将 RoBERTa PyTorch 模型转换为 ONNX 模型并对其进行量化。我能够从 ONNX 模型中获得单个输入数据点（每个句子）的分数。我想了解如何通过将多个输入传递给会话来使用 ONNX 运行时推理会话进行批量预测。下面是示例场景。

模型：roberta-quant.onnx，它是 RoBERTa PyTorch 模型的 ONNX 量化版本

用于将 RoBERTa 转换为 ONNX 的代码：

torch.onnx.export(model,                                            
                      args=tuple(inputs.values()),                      # model input 
                      f=export_model_path,                              # where to save the model 
                      opset_version=11,                                 # the ONNX version to export the model to
                      do_constant_folding=True,                         # whether to execute constant folding for optimization
                      input_names=['input_ids',                         # the model's input names
                                   'attention_mask'],
                      output_names=['output_0'],                    # the model's output names
                      dynamic_axes={'input_ids': symbolic_names,        # variable length axes
                                    'attention_mask' : symbolic_names,
                                    'output_0' : {0: 'batch_size'}})

向 ONNXRuntime 推理会话输入样本：

{
     'input_ids': array([[    0, 510, 35, 21071, ....., 1, 1,  1,  1, 1, 1]]),
     'attention_mask': array([[1, 1, 1, 1, ......., 0, 0, 0, 0, 0, 0]])
}

使用 ONNXRuntime 推理会话为 400 个数据样本（句子）运行 ONNX 模型：

session = onnxruntime.InferenceSession("roberta_quantized.onnx", providers=['CPUExecutionProvider'])
for i in range(400):
   ort_inputs = {
    'input_ids':  input_ids[i].cpu().reshape(1, max_seq_length).numpy(),  # max_seq_length=128 here
    'input_mask': attention_masks[i].cpu().reshape(1, max_seq_length).numpy()
   }

   ort_outputs = session.run(None, ort_inputs)

在上面的代码中，我依次循环遍历 400 个句子以获得分数“ ort_outputs”。请帮助我了解如何使用 ONNX 模型在此处执行批处理，我可以在其中发送多个句子的 and 并inputs_ids获取.attention_masksort_outputs

提前致谢！

nlp - 如何使用 RoBERTa ONNX 量化模型执行批量推理？

0 回答 0

Related

Reference