python - 在 Python 中快速读取 MPEG 传输流（二进制文件）PID 值

Question

我有一个大的 MPEG (.ts) 二进制文件，通常是 188 字节的倍数，我使用 python3，当我每次读取 188 字节并解析以获得所需的值时，我发现它真的很慢。我必须遍历每个 188 字节的数据包以获取 PID（二进制数据）的值。

同时，当我使用任何 MPEG 离线专业分析仪时，他们会在 45 秒内获得所有 PID 值及其总计数的列表，持续时间为 5 分钟的 TS 文件，而我的程序需要 10 分钟以上才能得到相同的结果。
我不明白他们能多快找到，即使它们可能是用 c 或 c++ 编写的。
我尝试了 python 多处理，但它没有多大帮助。这意味着我解析和处理 188 字节数据的方法不正确并导致巨大的延迟。

`with open(file2,'rb') as f:
data=f.read(188)
if len(data)==0: break
b=BitStream(data)
...   #parse b to get the required value 
...   # and increase count when needed
...
cnt=cnt+188 
f.seek(cnt)`

score 1 · Accepted Answer

这是你的代码人。

我也尝试了一段时间的Bitstream，它很慢。

cProfile 模块是您的朋友。

使用 pypy3，我可以在 2.9 秒内解析 3.7GB 的 mpegts，单进程。

使用 Go-lang，我可以在 1.2 秒内解析 3.7GB。

score 1 · Accepted Answer

你是个很酷的人。试试这样：

```import sys
from functools import partial


PACKET_SIZE= 188

def do():
    args = sys.argv[1:]
    for arg in args:
        print(f'next file: {arg}')
        pkt_num=0
        with open(arg,'rb') as vid:
             for pkt in iter(partial(vid.read, PACKET_SIZE), b""):
                 pkt_num +=1
                 pid =(pkt[1] << 8 | pkt[2]) & 0x01FFF
                 print(f'Packet: {pkt_num} Pid: {pid}', end='\r')
         
if __name__ == "__main__":
    do()

请记住，打印每个 pid 将；慢点，3.7 GB 的 mpegts 中有 2000 万个数据包

a@fumatica:~/threefive$ time pypy3 cli2.py plp0.ts 
next file: plp0.ts
Packet: 20859290 Pid: 1081
real    1m22.976s
user    0m48.331s
sys     0m34.323s

打印每个 pid 需要 1m22.976s

如果我注释掉

   #print(f'Packet: {pkt_num} Pid: {pid}', end='\r')

它走得更快

a@fumatica:~/threefive$ time pypy3 no_print.py plp0.ts 
next file: plp0.ts

real    0m3.080s
user    0m2.237s
sys     0m0.816s

如果我将打印调用更改为

                 print(f'Packet: {pkt_num} Pid: {pid}')

并将输出重定向到文件，

解析3.7GB仅需9秒

a@fumatica:~/threefive$ time pypy3 cli2.py plp0.ts > out.pids

real    0m9.228s
user    0m7.820s
sys     0m1.229s

a@fumatica:~/threefive$ wc -l out.pids 
20859291 out.pids

希望对你有所帮助。

python - 在 Python 中快速读取 MPEG 传输流（二进制文件）PID 值

2 回答 2

Related

Reference