有趣的问题。这是我的看法。
从本质上讲,字幕彼此“不知道”——这意味着最好在每个文档 ( n - 1
, n
, n + 1
) 中包含前一个和后一个字幕文本(只要适用)。
因此,您将需要一个类似于以下内容的文档结构:
{
"sub_id" : 0,
"start" : "00:02:17,440",
"end" : "00:02:20,375",
"text" : "Senator, we're making our final",
"overlapping_text" : "Senator, we're making our final approach into Coruscant."
}
为了达到这样的文档结构,我使用了以下内容(受此出色答案的启发):
from itertools import groupby
from collections import namedtuple
def parse_subs(fpath):
# "chunk" our input file, delimited by blank lines
with open(fpath) as f:
res = [list(g) for b, g in groupby(f, lambda x: bool(x.strip())) if b]
Subtitle = namedtuple('Subtitle', 'sub_id start end text')
subs = []
# grouping
for sub in res:
if len(sub) >= 3: # not strictly necessary, but better safe than sorry
sub = [x.strip() for x in sub]
sub_id, start_end, *content = sub # py3 syntax
start, end = start_end.split(' --> ')
# ints only
sub_id = int(sub_id)
# join multi-line text
text = ', '.join(content)
subs.append(Subtitle(
sub_id,
start,
end,
text
))
es_ready_subs = []
for index, sub_object in enumerate(subs):
prev_sub_text = ''
next_sub_text = ''
if index > 0:
prev_sub_text = subs[index - 1].text + ' '
if index < len(subs) - 1:
next_sub_text = ' ' + subs[index + 1].text
es_ready_subs.append(dict(
**sub_object._asdict(),
overlapping_text=prev_sub_text + sub_object.text + next_sub_text
))
return es_ready_subs
一旦字幕被解析,它们就可以被摄取到 ES 中。在此之前,请设置以下映射,以便您的时间戳可正确搜索和排序:
PUT my_subtitles_index
{
"mappings": {
"properties": {
"start": {
"type": "text",
"fields": {
"as_timestamp": {
"type": "date",
"format": "HH:mm:ss,SSS"
}
}
},
"end": {
"type": "text",
"fields": {
"as_timestamp": {
"type": "date",
"format": "HH:mm:ss,SSS"
}
}
}
}
}
}
完成后,继续摄取:
from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
from utils.parse import parse_subs
es = Elasticsearch()
es_ready_subs = parse_subs('subs.txt')
actions = [
{
"_index": "my_subtitles_index",
"_id": sub_group['sub_id'],
"_source": sub_group
} for sub_group in es_ready_subs
]
bulk(es, actions)
摄取后,您可以定位原始字幕text
并在它与您的短语直接匹配时对其进行增强。否则,在文本上添加一个后备选项,以overlapping
确保返回两个“重叠”的字幕。
在返回之前,您可以确保命中按start
, 升序排列。这种方式违背了提升的目的,但如果您进行排序,您可以track_scores:true
在 URI 中指定以确保也返回最初计算的分数。
把它们放在一起:
POST my_subtitles_index/_search?track_scores&filter_path=hits.hits
{
"query": {
"bool": {
"should": [
{
"match_phrase": {
"text": {
"query": "final approach",
"boost": 2
}
}
},
{
"match_phrase": {
"overlapping_text": {
"query": "final approach"
}
}
}
]
}
},
"sort": [
{
"start.as_timestamp": {
"order": "asc"
}
}
]
}
产量:
{
"hits" : {
"hits" : [
{
"_index" : "my_subtitles_index",
"_type" : "_doc",
"_id" : "0",
"_score" : 6.0236287,
"_source" : {
"sub_id" : 0,
"start" : "00:02:17,440",
"end" : "00:02:20,375",
"text" : "Senator, we're making our final",
"overlapping_text" : "Senator, we're making our final approach into Coruscant."
},
"sort" : [
137440
]
},
{
"_index" : "my_subtitles_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 5.502407,
"_source" : {
"sub_id" : 1,
"start" : "00:02:20,476",
"end" : "00:02:22,501",
"text" : "approach into Coruscant.",
"overlapping_text" : "Senator, we're making our final approach into Coruscant. Very good, Lieutenant."
},
"sort" : [
140476
]
}
]
}
}