python - 使用正则表达式解析 .srt 文件

Question

我正在用 python 编写一个小脚本，但由于我是新手，所以我陷入了困境：我需要从.srt文件中获取时间和文本。例如，从

1
00:00:01,000 --> 00:00:04,074
Subtitles downloaded from www.OpenSubtitles.org

我需要得到：

00:00:01,000 --> 00:00:04,074

和

Subtitles downloaded from www.OpenSubtitles.org.

我已经设法为时间制作正则表达式，但我被文本卡住了。我尝试在使用正则表达式进行计时的地方使用look behind ：

( ?<=(\d+):(\d+):(\d+)(?:\,)(\d+) --> (\d+):(\d+):(\d+)(?:\,)(\d+) )\w+

但没有效果。就个人而言，我认为使用look behind是解决这个问题的正确方法，但我不确定如何正确编写它。谁能帮我？谢谢。

score 14 · Accepted Answer

老实说，我看不出有任何理由在这个问题上抛出正则表达式。 .srt文件是高度结构化的。结构如下：

从 1 开始的整数，单调递增
开始 --> 停止计时
一行或多行字幕内容
空行

...并重复。请注意粗体部分 - 您可能必须在时间码之后捕获 1、2 或 20 行字幕内容。

因此，只需利用结构。通过这种方式，您可以一次解析所有内容，而无需一次将多行放入内存，并且仍然将每个字幕的所有信息保存在一起。

from itertools import groupby
# "chunk" our input file, delimited by blank lines
with open(filename) as f:
    res = [list(g) for b,g in groupby(f, lambda x: bool(x.strip())) if b]

例如，使用 SRT 文档页面上的示例，我得到：

res
Out[60]: 
[['1\n',
  '00:02:17,440 --> 00:02:20,375\n',
  "Senator, we're making\n",
  'our final approach into Coruscant.\n'],
 ['2\n', '00:02:20,476 --> 00:02:22,501\n', 'Very good, Lieutenant.\n']]

我可以进一步将其转换为有意义的对象列表：

from collections import namedtuple

Subtitle = namedtuple('Subtitle', 'number start end content')

subs = []

for sub in res:
    if len(sub) >= 3: # not strictly necessary, but better safe than sorry
        sub = [x.strip() for x in sub]
        number, start_end, *content = sub # py3 syntax
        start, end = start_end.split(' --> ')
        subs.append(Subtitle(number, start, end, content))

subs
Out[65]: 
[Subtitle(number='1', start='00:02:17,440', end='00:02:20,375', content=["Senator, we're making", 'our final approach into Coruscant.']),
 Subtitle(number='2', start='00:02:20,476', end='00:02:22,501', content=['Very good, Lieutenant.'])]

score 2 · Accepted Answer

不同意@roippi。正则表达式是一个非常好的文本匹配解决方案。这个解决方案的正则表达式并不棘手。

import re   

f = file.open(yoursrtfile)
# Parse the file content
content = f.read()
# Find all result in content
# The first big (__) retrieve the timing, \s+ match all timing in between,
# The (.+) means retrieve any text content after that.
result = re.findall("(\d+:\d+:\d+,\d+ --> \d+:\d+:\d+,\d+)\s+(.+)", content)
# Just print out the result list. I recommend you do some formatting here.
print result

score 1 · Accepted Answer

编号：^[0-9]+$
时间：
^[0-9][0-9]:[0-9][0-9]:[0-9][0-9],[0-9][0-9][0-9] --> [0-9][0-9]:[0-9][0-9]:[0-9][0-9],[0-9][0-9][0-9]$
字符串：*[a-zA-Z]+*

希望这有帮助。

score 1 · Accepted Answer

感谢@roippi 提供了这个出色的解析器。在不到 40 行的时间内编写了一个 srt 到 stl 的转换器对我帮助很大（虽然在 python2 中，因为它必须适合更大的项目）

from __future__ import print_function, division
from itertools import groupby
from collections import namedtuple

# prepare  - adapt to you needs or use sys.argv
inputname = 'FR.srt'  
outputname = 'FR.stl'
stlheader = """
$FontName           = Arial
$FontSize           = 34
$HorzAlign          = Center
$VertAlign          = Bottom

"""
def converttime(sttime):
    "convert from srt time format (0...999) to stl one (0...25)"
    st = sttime.split(',')
    return "%s:%02d"%(st[0], round(25*float(st[1])  /1000))

# load
with open(inputname,'r') as f:
    res = [list(g) for b,g in groupby(f, lambda x: bool(x.strip())) if b]

# parse
Subtitle = namedtuple('Subtitle', 'number start end content')
subs = []
for sub in res:
    if len(sub) >= 3: # not strictly necessary, but better safe than sorry
        sub = [x.strip() for x in sub]
        number, start_end, content = sub[0], sub[1], sub[2:]   # py 2 syntax
        start, end = start_end.split(' --> ')
        subs.append(Subtitle(number, start, end, content))

# write
with open(outputname,'w') as F:
    F.write(stlheader)
    for sub in subs:
        F.write("%s , %s , %s\n"%(converttime(sub.start), converttime(sub.end), "|".join(sub.content)) )

score 0 · Accepted Answer

以上纯 REGEx 解决方案均不适用于现实生活中的 srt 文件。

让我们看一下以下 SRT 模式文本：

1
00:02:17,440 --> 00:02:20,375
Some multi lined text
This is a second line

2
00:02:20,476 --> 00:02:22,501
as well as a single line

3
00:03:20,476 --> 00:03:22,501
should be able to parse unicoded text too
こんにちは

请注意：

text 可能包含 unicode 字符。
文本可以由多行组成。
每个提示都以整数值开始，并以一个空白的新行结束，unix 风格和 windows 风格的 CR/LF 都被接受

这是工作正则表达式：

\d+[\r\n](\d+:\d+:\d+,\d+ --> \d+:\d+:\d+,\d+)[\r\n]((.+\r?\n)+(?=(\r?\n)?))

https://regex101.com/r/qICmEM/1

score 0 · Accepted Answer

0

时间：

pattern = ("(\d{2}:\d{2}:\d{2},\d{3}?.*)")

于 2015-12-04T10:23:30.257 回答

python - 使用正则表达式解析 .srt 文件

6 回答 6

Related

Reference