nlp - 在 spacy 中使用基于规则的语法进行分块

Question

我在 nltk 中有一个简单的分块示例。

我的数据：

data = 'The little yellow dog will then walk to the Starbucks, where he will introduce them to Michael.'

...预处理...

data_tok = nltk.word_tokenize(data) #tokenisation
data_pos = nltk.pos_tag(data_tok) #POS tagging

大块：

cfg_1 = "CUSTOMCHUNK: {<VB><.*>*?<NNP>}" #should return `walk to the Starbucks`, etc.
chunker = nltk.RegexpParser(cfg_1)
data_chunked = chunker.parse(data_pos)

这会返回（除其他外）：，(CUSTOMCHUNK walk/VB to/TO the/DT Starbucks/NNP)所以它做了我想做的事。

现在我的问题是：我想为我的项目切换到 spacy。我将如何在 spacy 中做到这一点？

我来标记它（更粗略的.pos方法会为我做）：

from spacy.en import English    
parser = English()
parsed_sent = parser(u'The little yellow dog will then walk to the Starbucks, where')

def print_coarse_pos(token):
  print(token, token.pos_)

for sentence in parsed_sent.sents:
  for token in sentence:
    print_coarse_pos(token)

...返回标签和令牌 The DET little ADJ yellow ADJ dog NOUN will VERB then ADV walk VERB ...

如何用自己的语法提取块？

score 4 · Accepted Answer

从https://github.com/spacy-io/spaCy/issues/342逐字复制

有几种方法可以解决这个问题。与该类最接近的功能RegexpParser是 spaCy 的Matcher. 但是对于语法分块，我通常会使用依赖解析。例如，对于 NP 分块，您有doc.noun_chunks迭代器：

doc = nlp(text)
for np in doc.noun_chunks:
    print(np.text)

这个工作的基本方式是这样的：

for token in doc:
    if is_head_of_chunk(token)
        chunk_start = token.left_edge.i
        chunk_end = token.right_edge.i + 1
        yield doc[chunk_start : chunk_end]

您可以随意定义假设is_head_of函数。您可以使用依赖解析可视化工具来查看句法注释方案，并找出要使用的标签： http: //spacy.io/demos/displacy

nlp - 在 spacy 中使用基于规则的语法进行分块

1 回答 1

Related

Reference