python - spaCy nlp - 在字符串中标记实体

Question

假设我有一个字符串，并且想要标记一些实体，例如 Persons 和 Locations。

string = 'My name is John Doe, and I live in USA'
string_tagged = 'My name is [John Doe], and I live in {USA}'

我想用 [ ] 标记人，用 { } 标记位置。

我的代码：

import spacy    
nlp = spacy.load('en')
doc = nlp(string)
sentence = doc.text
for ent in doc.ents:
    if ent.label_ == 'PERSON':
        sentence = sentence[:ent.start_char] + sentence[ent.start_char:].replace(ent.text, '[' + ent.text + ']', 1)
    elif ent.label_ == 'GPE':
        sentence = sentence[:ent.start_char] + sentence[ent.start_char:].replace(ent.text, '{' + ent.text + '}', 1)

    print(sentence[:ent.start_char] + sentence[ent.start_char:])

...所以使用示例字符串可以正常工作。但是对于更复杂的句子，我会在某些实体周围加上双引号。对于句子：

string_bug = 'Canada, Canada, Canada, Canada, Canada, Canada'

返回>> {Canada}, {Canada}, {Canada}, {Canada}, {{Canada}}, Canada

我将句子字符串一分为二的原因是只替换新词（具有更高的字符位置）。我认为这个错误可能在于我正在循环doc.ents，所以我得到了我的字符串的旧位置，并且字符串随着新的 [] 和 {} 的每个循环而增长。但是感觉在 spaCy 中必须有一些更简单的方法来处理这个问题。

score 0 · Accepted Answer

这是一个帮助我使用您的代码的轻微修改。

string = 'My name is John Doe, and I live in USA'

import re
import spacy
nlp = spacy.load('en')
doc = nlp(string)
sentence = doc.text
for ent in doc.ents:
    if ent.label_ == 'PERSON':
        sentence = re.sub(ent.text, '[' + ent.text + ']', sentence)
    elif ent.label_ == 'GPE':
        sentence = re.sub(ent.text, '{' + ent.text + '}', sentence)
print sentence

产量：

My name is [John Doe], and I live in {USA}

python - spaCy nlp - 在字符串中标记实体

1 回答 1

Related

Reference