您必须更具体地了解您想要真正提取的内容:
但这是一个尝试。
您似乎正在尝试使用形容词/副词提取动词短语,如果是这样,您可以尝试:
from nltk import pos_tag, word_tokenize
from nltk import ngrams
text = "this is not good."
tagged_text = pos_tag(word_tokenize(text))
focus_tags = set(['JJ', 'JJS', 'RB', 'RBR', 'RBS', 'VB', 'VBN', 'VBP'])
for (token1, tag1), (token2, tag2) in ngrams(tagged_text, 2):
if tag1 in focus_tags and tag2 in focus_tags:
print(token1 + ' ' + token2)
但输出:is not
和is not good
!
嗯,在这种情况下,你想精确not good
还是is not good
?
如果是is not good
三元组,请尝试:
for (token1, tag1), (token2, tag2), (token3, tag3) in ngrams(tagged_text, 3):
if tag1 in focus_tags and tag2 in focus_tags and tag3 in focus_tags:
print(token1 + ' ' + token2 + ' ' + token3)
如果我只是想要not good
怎么办?
也许尝试删除动词?例如
from nltk import pos_tag, word_tokenize
from nltk import ngrams
text = "this is not good."
tagged_text = pos_tag(word_tokenize(text))
focus_tags = set(['JJ', 'JJS', 'RB', 'RBR', 'RBS'])
for (token1, tag1), (token2, tag2) in ngrams(tagged_text, 2):
if tag1 in focus_tags and tag2 in focus_tags:
print(token1 + ' ' + token2)