python - 如何在 Polyglot python 库找到的实体的原始文本中获取索引？

Question

我想在使用 python 的 poplyglot 库找到的实体的原始文本中获取索引。

    # Polyglot example NER
    from polyglot.text import Text
    text1 = u'Ik wil Ben mijn zoontje met de naam Ben ziek melden.'
    print(text1)
    ptext1 = Text(text1)
    print(ptext1.entities)
    for sent in ptext1.sentences:
        for entity in sent.entities:
          print(entity.tag, entity, entity.start, entity.end)

结果是：[I-PER(['Ben'])] I-PER ['Ben'] 8 9

所以问题是如果这些块索引在原始句子中，我如何获得开始和结束索引？

score 0 · Accepted Answer

如果有一天有人需要更好的版本：

from typing import Tuple
from polyglot.text import Text, Sentence, Chunk

doc = "         Apple is looking at buying Samsung for $1 billion and Donald Trump isnt happy.               Second sentence with this time Joe Biden."
text = Text(doc, hint_language_code="en")

def get_position_in_text(sentence: Text, entity: Chunk) -> Tuple[int, int]:
    """ Get the position in text (chars count) """
    sent = sentence.raw
    start_search = len("".join(sentence.words[0:entity.start]))
    try:
        start_pos = sent.index(entity[0], start_search)
        # Its a single world, that case is eaiser
        if len(entity) == 1:
            return start_pos, start_pos + len(entity[0])
        else:
            start_search = start_pos + len("".join(sentence.words[entity.start:entity.end - 1]))
            end_pos = sent.index(entity[-1], start_search)
            return start_pos, end_pos + len(entity[-1])
    except ValueError:
        return -1, -1

print(text.raw + "\n")
for entity in text.entities:
    # Polyglot do not gives you the position
    # but its possible with an algorithm to find
    # it...
    start_pos, end_pos = get_position_in_text(text, entity)
    print(entity.tag, entity, "start", start_pos, "end", end_pos)

这是一个更好的版本，因为上面的版本确实每个句子都给出了，并且句子在前后被剥离了空格，导致偏移量很容易出错。

这个代替使用 text.raw ，它用空格等保持文本完整。

score 0 · Accepted Answer

刚刚为我的问题找到了解决方案（也许不是最好的，但现在可以了）：

ptext1 = Text(text1) 
prevIndex = 0 
for sent in ptext1.sentences: 
    for entity in sent.entities: 
        print(entity.tag, entity, entity.start, entity.end) 
        currentIndex = ptext1.index(entity[0], prevIndex) 
        print('startindex={}, endindex={}'.format(currentIndex, currentIndex+len(entity[0]))) 
        prevIndex = currentIndex+len(entity[0])

这将提供原始字符串中实体的开始索引和结束索引。

python - 如何在 Polyglot python 库找到的实体的原始文本中获取索引？

2 回答 2

Related

Reference