spacy - Spacy 中的多词表达识别

Question

我有一个文本以及索引条目，其中一些指示文本中出现的重要多词表达（MWE）（例如生物学文本的“海绵状骨骼”）。我想使用这些条目在 spaCy 中构造一个自定义匹配器，以便我可以识别文本中 MWE 的出现。另一个要求是我需要匹配出现来保留 MWE 组成词的词形还原表示和 POS 标签。

我已经查看了做类似事情的现有 spaCy 示例，但我似乎无法理解这种模式。

score -1 · Accepted Answer

Spacy 文档对使用带有多个短语的 Matcher 类不是很清楚，但在 Github 存储库中有一个多短语匹配示例。

我最近面临同样的挑战，我得到它的工作如下。我的文本文件每行包含一个带有短语的记录，其描述由'::'分隔。

import spacy
import io
from spacy.matcher import PhraseMatcher

nlp = spacy.load('en')
text = nlp(u'Your text here')
rules = list()

# Create a list of tuple of phrase and description from the file
with io.open('textfile','r',encoding='utf8') as doc:
    rules = [tuple(line.rstrip('\n').split('::')) for line in doc]

# convert the phrase string to a spacy doc object 
rules = [(nlp(item[0].lower()),item[-1]) for item in rules ]

# create a dictionary for accessing value using the string as the index which is returned by matcher class
rules_dict = dict()
for key,val in rules:
    rules_dict[key.text]=val

# get just the phrases from rules list
rules_phrases = [item[0] for item in rules]

# match using the PhraseMatcher class
matcher = PhraseMatcher(nlp.vocab,rules_phrases)
matches = matcher(text)
result = list()

for start,end,tag,label,m in matches:
    result.append({"start":start,"end":end,"phrase":label,"desc":rules_dict[label]})
print(result)

spacy - Spacy 中的多词表达识别

1 回答 1

Related

Reference