tensorflow - FP16 甚至不比在 TensorRT 中使用 FP32 快两倍

Question

我使用了 TensorRT，Tensorflow 模型在 FP16 和 FP32 模式下转换为 TensorRT 引擎。

用 10 张图像进行测试，FP32 甚至不比 FP16 模式快两倍。预计至少快两倍。这是使用图灵架构的 Titan RTX 规格

Using Titan RTX
    FP16
    msec: 0.171075
    msec: 0.134830
    msec: 0.129984
    msec: 0.128638
    msec: 0.118196
    msec: 0.123429
    msec: 0.134329
    msec: 0.119506
    msec: 0.117615
    msec: 0.127687


    FP32
    msec: 0.199235
    msec: 0.180985
    msec: 0.153394
    msec: 0.148267
    msec: 0.151481
    msec: 0.169578
    msec: 0.159987
    msec: 0.173443
    msec: 0.159301
    msec: 0.155503

EDIT_1：根据@y.selivonchyk 的回复，在Tesla T4 上进行了测试。但是FP16并不比FP32快。

Using Tesla T4
FP16
msec: 0.169800
msec: 0.136175
msec: 0.127025
msec: 0.130406
msec: 0.129874
msec: 0.122248
msec: 0.128244
msec: 0.126983
msec: 0.131111
msec: 0.138897

FP32
msec: 0.168589
msec: 0.130539
msec: 0.122617
msec: 0.120955
msec: 0.128452
msec: 0.122426
msec: 0.125560
msec: 0.130016
msec: 0.126965
msec: 0.121818

这个结果可以接受吗？或者我还需要研究什么？

在本文档第 15 页中，FP32 和 FP16 之间存在 5 倍图像/秒的差异。

我的 UFF 模型和推理引擎序列化代码如下所示。

def serializeandsave_engine(model_file):
    # For more information on TRT basics, refer to the introductory samples.
    with trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network, trt.UffParser() as parser:
        builder.max_batch_size = 1#max_batch_size
        builder.max_workspace_size = 1 <<  30
        builder.fp16_mode = True
        builder.strict_type_constraints = True
        # Parse the Uff Network
        parser.register_input("image", (3, height, width))#UffInputOrder.NCHW
        parser.register_output("Openpose/concat_stage7")#check input output names with tf model
        parser.parse(model_file, network)
        # Build and save the engine.
        engine = builder.build_cuda_engine(network)
        serialized_engine = engine.serialize()
        with open(engine_path, 'wb') as f:
           f.write(engine.serialize())
        return

def infer(engine, x, batch_size, context):
    inputs = []
    outputs = []
    bindings = []
    stream = cuda.Stream()
    for binding in engine:
        size = trt.volume(engine.get_binding_shape(binding)) * batch_size
        dtype = trt.nptype(engine.get_binding_dtype(binding))
        # Allocate host and device buffers
        host_mem = cuda.pagelocked_empty(size, dtype)
        device_mem = cuda.mem_alloc(host_mem.nbytes)
        # Append the device buffer to device bindings.
        bindings.append(int(device_mem))
        # Append to the appropriate list.
        if engine.binding_is_input(binding):
            inputs.append(HostDeviceMem(host_mem, device_mem))
        else:
            outputs.append(HostDeviceMem(host_mem, device_mem))
    #img = np.array(x).ravel()
    np.copyto(inputs[0].host, x.flatten())  #1.0 - img / 255.0
    [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
    context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)
    # Transfer predictions back from the GPU.
    [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
    # Synchronize the stream
    stream.synchronize()

score 3 · Accepted Answer

Titan 系列显卡始终只是具有更多内核的消费级显卡的更强大版本。Titans 从来没有专门的 FP16 内核来让它们在半精度训练中运行得更快。（幸运的是，与 1080 不同，它们不会在 FP16 上运行得更慢）。

这一假设在接下来的 2 篇评论中得到证实：pugetsystems和tomshardaware，其中 Titan RTX 在使用半精度浮点数时显示出约 20% 的适度改进。

简而言之，FP16 只有在芯片上存在专用硬件模块时才会更快，而 Titan 系列通常不是这种情况。然而，FP16 仍然允许在训练期间减少内存消耗并运行更大的模型。

score 0 · Accepted Answer

转换非常基于硬件和模型，因此在 FP16 精度模式下延迟并不总是减少一半。您可以看到一种硬件没有显着变化，但另一种硬件发生了巨大变化。我还建议使用更多图片或通过批处理让相同的图片通过模型更多次，因为进行一些热身运行总是好的。因此，为模型提供至少 200,300 张图像和 50-100 次热身，以获得更好、更真实的结果

tensorflow - FP16 甚至不比在 TensorRT 中使用 FP32 快两倍

2 回答 2

Related

Reference