7

具有以下数据(简单 srt)

1
00:02:17,440 --> 00:02:20,375
Senator, we're making our final

2
00:02:20,476 --> 00:02:22,501
approach into Coruscant.

...

在 Elasticsearch 中索引它的最佳方法是什么?现在有一个问题:我希望搜索结果突出显示链接到时间戳指示的确切时间。此外,还有多个 srt 行重叠的短语(final approach例如上面的示例)。

我的想法是

  • 将 srt 文件索引为列表类型,时间戳是索引。我相信这与重叠多个键的短语不匹配
  • 创建仅索引文本部分的自定义标记器。我不确定 elasticsearch 能在多大程度上突出显示原始内容。
  • 仅索引文本部分并将其映射回弹性搜索之外的时间戳

还是有另一种选择可以优雅地解决这个问题?

4

1 回答 1

5

有趣的问题。这是我的看法。

从本质上讲,字幕彼此“不知道”——这意味着最好在每个文档 ( n - 1, n, n + 1) 中包含前一个和后一个字幕文本(只要适用)。

因此,您将需要一个类似于以下内容的文档结构:

{
  "sub_id" : 0,
  "start" : "00:02:17,440",
  "end" : "00:02:20,375",
  "text" : "Senator, we're making our final",
  "overlapping_text" : "Senator, we're making our final approach into Coruscant."
}

为了达到这样的文档结构,我使用了以下内容(受此出色答案的启发):

from itertools import groupby
from collections import namedtuple


def parse_subs(fpath):
    # "chunk" our input file, delimited by blank lines
    with open(fpath) as f:
        res = [list(g) for b, g in groupby(f, lambda x: bool(x.strip())) if b]

    Subtitle = namedtuple('Subtitle', 'sub_id start end text')

    subs = []

    # grouping
    for sub in res:
        if len(sub) >= 3:  # not strictly necessary, but better safe than sorry
            sub = [x.strip() for x in sub]
            sub_id, start_end, *content = sub  # py3 syntax
            start, end = start_end.split(' --> ')

            # ints only
            sub_id = int(sub_id)

            # join multi-line text
            text = ', '.join(content)

            subs.append(Subtitle(
                sub_id,
                start,
                end,
                text
            ))

    es_ready_subs = []

    for index, sub_object in enumerate(subs):
        prev_sub_text = ''
        next_sub_text = ''

        if index > 0:
            prev_sub_text = subs[index - 1].text + ' '

        if index < len(subs) - 1:
            next_sub_text = ' ' + subs[index + 1].text

        es_ready_subs.append(dict(
            **sub_object._asdict(),
            overlapping_text=prev_sub_text + sub_object.text + next_sub_text
        ))

    return es_ready_subs

一旦字幕被解析,它们就可以被摄取到 ES 中。在此之前,请设置以下映射,以便您的时间戳可正确搜索和排序:

PUT my_subtitles_index
{
  "mappings": {
    "properties": {
      "start": {
        "type": "text",
        "fields": {
          "as_timestamp": {
            "type": "date",
            "format": "HH:mm:ss,SSS"
          }
        }
      },
      "end": {
        "type": "text",
        "fields": {
          "as_timestamp": {
            "type": "date",
            "format": "HH:mm:ss,SSS"
          }
        }
      }
    }
  }
}

完成后,继续摄取:

from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk

from utils.parse import parse_subs

es = Elasticsearch()

es_ready_subs = parse_subs('subs.txt')

actions = [
    {
        "_index": "my_subtitles_index",
        "_id": sub_group['sub_id'],
        "_source": sub_group
    } for sub_group in es_ready_subs
]

bulk(es, actions)

摄取后,您可以定位原始字幕text并在它与您的短语直接匹配时对其进行增强。否则,在文本上添加一个后备选项,以overlapping确保返回两个“重叠”的字幕。

在返回之前,您可以确保命中按start, 升序排列。这种方式违背了提升的目的,但如果您进行排序,您可以track_scores:true在 URI 中指定以确保也返回最初计算的分数。

把它们放在一起:

POST my_subtitles_index/_search?track_scores&filter_path=hits.hits
{
  "query": {
    "bool": {
      "should": [
        {
          "match_phrase": {
            "text": {
              "query": "final approach",
              "boost": 2
            }
          }
        },
        {
          "match_phrase": {
            "overlapping_text": {
              "query": "final approach"
            }
          }
        }
      ]
    }
  },
  "sort": [
    {
      "start.as_timestamp": {
        "order": "asc"
      }
    }
  ]
}

产量:

{
  "hits" : {
    "hits" : [
      {
        "_index" : "my_subtitles_index",
        "_type" : "_doc",
        "_id" : "0",
        "_score" : 6.0236287,
        "_source" : {
          "sub_id" : 0,
          "start" : "00:02:17,440",
          "end" : "00:02:20,375",
          "text" : "Senator, we're making our final",
          "overlapping_text" : "Senator, we're making our final approach into Coruscant."
        },
        "sort" : [
          137440
        ]
      },
      {
        "_index" : "my_subtitles_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 5.502407,
        "_source" : {
          "sub_id" : 1,
          "start" : "00:02:20,476",
          "end" : "00:02:22,501",
          "text" : "approach into Coruscant.",
          "overlapping_text" : "Senator, we're making our final approach into Coruscant. Very good, Lieutenant."
        },
        "sort" : [
          140476
        ]
      }
    ]
  }
}
于 2021-04-16T11:41:01.110 回答