python - Wave2Lip 的使用和性能

Question

实际问题：

目前 opencv 用于将视频帧写入单个文件。您可以直接附加音频还是有其他方法可以创建小的视频片段，这些片段可以通过 rtp 协议广播或直接从 python 代码广播？

out = cv2.VideoWriter(
        'temp/result.avi', 
        cv2.VideoWriter_fourcc(*'DIVX'), 
        fps, 
        (frame_w, frame_h))

... #some frame manipulation happening

out.write(f) # f = video frame

我不想编写视频文件，然后使用 ffmpeg 将其与音频结合起来。

背景：

我正在尝试编写一个需要实时 LypSincing 的应用程序。为此，我正在试验Wave2Lip。起初，这个库似乎很慢，但实际上通过一些优化可以很快。

实验：

我首先使用以下命令手动将一个视频与另一个视频文件进行 lypsinced。

python inference.py --checkpoint_path ptmodels\wav2lip.pth --face testdata\face.mp4 --audio testdata\audio.mp4

face.mp4文件时长25秒，30fps，分辨率854*480 audio.mp4文件时长260秒，30fps，分辨率480x360

总生成时间正好是109秒。在剖析代码并对其进行分析后，我发现有两个部分花费的时间最长：

人脸检测部分耗时 48.64 秒
lypsincing 部分耗时 48.50 秒

然后我用静态图像而不是视频进行了尝试，这大大缩短了时间（在我的用例中，我稍后将只使用同一张脸，因此我将适当地在启动时预先计算人脸检测）。

python inference.py --checkpoint_path ptmodels\wav2lip.pth --face testdata\face.jpg --audio testdata\audio.mp4

人脸检测部分耗时 1.01 秒
lypsincing 部分耗时 48.50 秒

在查看了 lypsincing 部分后，我发现生成了整个 lypsincing 视频，然后与视频相结合

for i, (img_batch, mel_batch, frames, coords) in enumerate(tqdm(gen, 
                                        total=int(np.ceil(float(len(mel_chunks))/batch_size)))):
    if i == 0:
        model = load_model(args.checkpoint_path)
        print ("Model loaded")

        frame_h, frame_w = full_frames[0].shape[:-1]
        out = cv2.VideoWriter('temp/result.avi', 
                                cv2.VideoWriter_fourcc(*'DIVX'), fps, (frame_w, frame_h))

    img_batch = torch.FloatTensor(np.transpose(img_batch, (0, 3, 1, 2))).to(device)
    mel_batch = torch.FloatTensor(np.transpose(mel_batch, (0, 3, 1, 2))).to(device)

    with torch.no_grad():
        pred = model(mel_batch, img_batch)

    pred = pred.cpu().numpy().transpose(0, 2, 3, 1) * 255.
    
    for p, f, c in zip(pred, frames, coords):
        y1, y2, x1, x2 = c
        p = cv2.resize(p.astype(np.uint8), (x2 - x1, y2 - y1))

        f[y1:y2, x1:x2] = p
        out.write(f)

out.release()
command = 'ffmpeg -y -i {} -i {} -strict -2 -q:v 1 {}'.format(args.audio, 'temp/result.avi', args.outfile)
subprocess.call(command, shell=platform.system() != 'Windows')

然后我决定用以下结果分析每个 lypsincing 周期：

lypsinc frames generated for batch 0 containing 128 frames with 30.0fps (video part length: 4.26s), took:  3.51s
lypsinc frames generated for batch 1 containing 128 frames with 30.0fps (video part length: 4.26s), took:  0.73s
lypsinc frames generated for batch 2 containing 128 frames with 30.0fps (video part length: 4.26s), took:  0.76s

...

lypsinc frames generated for batch 53 containing 128 frames with 30.0fps (video part length: 4.26s), took:  0.73s
lypsinc frames generated for batch 54 containing 17 frames with 30.0fps (video part length: 0.56s), took:  0.89s

all lypsinc frames generated, took:  48.50s

结论：在解决了面部检测（或者更像是暂停）后，在第一批视频帧准备好之前，lypsincing 大约需要 5 秒。每个 lypsinced 视频批次长 4.26 秒，计算它大约需要 0.8 秒。这意味着如果要将此视频批次与音频帧一起流式传输，则应该有可能在 5 秒延迟后开始渲染，而不是在此用例中的 50 秒。

python - Wave2Lip 的使用和性能

0 回答 0

Related

Reference