nlp - 对于 Spacy 返回的那些句子，检索原始文档中的开始和结束字符索引

Question

我正在使用类似于以下模式的东西来检索原始文档中 Spacy 句子的开始和结束索引：

nlp = spacy.en.English()
doc = nlp(fulltext)

tot = 0
prev_end=0
for sent in doc.sents:
    x = re.search(re.escape(sent.text), fulltext)
    print (x.start(), x.end(), ">>>", sent.text)
    tot += (x.end()-prev_end)
    prev_end = x.end()

if len(fulltext) == tot: print ("works")

这似乎适用于我使用的少数测试文档。但担心我是否忽略了像 spacy 这样的“陷阱”，有时会剥离一些我不知道的字符。我是吗？

PS：如果有帮助，我需要将这些索引与 Brat 注释文件中的索引进行比较。

score 7 · Accepted Answer

您应该只能使用sent.start_charandsent.end_char属性。这些给出了你所追求的指数：https ://spacy.io/docs/api/span#attributes

也doc.text应始终等于原始全文。如果没有，请提交错误报告。

nlp - 对于 Spacy 返回的那些句子，检索原始文档中的开始和结束字符索引

1 回答 1

Related

Reference