我在 nltk 中有一个简单的分块示例。
我的数据:
data = 'The little yellow dog will then walk to the Starbucks, where he will introduce them to Michael.'
...预处理...
data_tok = nltk.word_tokenize(data) #tokenisation
data_pos = nltk.pos_tag(data_tok) #POS tagging
大块:
cfg_1 = "CUSTOMCHUNK: {<VB><.*>*?<NNP>}" #should return `walk to the Starbucks`, etc.
chunker = nltk.RegexpParser(cfg_1)
data_chunked = chunker.parse(data_pos)
这会返回(除其他外):,(CUSTOMCHUNK walk/VB to/TO the/DT Starbucks/NNP)
所以它做了我想做的事。
现在我的问题是:我想为我的项目切换到 spacy。我将如何在 spacy 中做到这一点?
我来标记它(更粗略的.pos
方法会为我做):
from spacy.en import English
parser = English()
parsed_sent = parser(u'The little yellow dog will then walk to the Starbucks, where')
def print_coarse_pos(token):
print(token, token.pos_)
for sentence in parsed_sent.sents:
for token in sentence:
print_coarse_pos(token)
...返回标签和令牌
The DET
little ADJ
yellow ADJ
dog NOUN
will VERB
then ADV
walk VERB
...
如何用自己的语法提取块?