python - 对词干分析器和 pos 标记器之间的优先级感到困惑

Question

所以我正在分析一个文本语料库，我对所有标记化的词都使用了词干分析器。但是我还必须找到语料库中的所有名词，所以我又做了一次nltk.pos_tag(stemmed_sentence) 但是我的问题是我做得对吗？

A.] tokenize->stem->pos_tagging

或者

B.] tokenize->stem       #stemming and pos_tagging done seperately
    tokeinze->pos_tagging

我遵循了方法 A，但我对它实现 pos_tagging 的正确方法感到困惑。

score 7 · Accepted Answer

你为什么不试试看？

这是一个例子：

>>> from nltk.stem import PorterStemmer
>>> from nltk import word_tokenize, pos_tag
>>> sent = "This is a messed up sentence from the president's Orama and it's going to be sooo good, you're gonna laugh."

这是代币化的结果。

>>> [word for word in word_tokenize(sent)]
['This', 'is', 'a', 'messed', 'up', 'sentence', 'from', 'the', 'president', "'s", 'Orama', 'and', 'it', "'s", 'going', 'to', 'be', 'sooo', 'good', ',', 'you', "'re", 'gon', 'na', 'laugh', '.']

这是 tokenize -> stem 的结果

>>> porter = PorterStemmer()
>>> [porter.stem(word) for word in word_tokenize(sent)]
[u'Thi', u'is', u'a', u'mess', u'up', u'sentenc', u'from', u'the', u'presid', u"'s", u'Orama', u'and', u'it', u"'s", u'go', u'to', u'be', u'sooo', u'good', u',', u'you', u"'re", u'gon', u'na', u'laugh', u'.']

这是 tokenize -> stem -> POS tag 的结果

>>> pos_tag([porter.stem(word) for word in word_tokenize(sent)])
[(u'Thi', 'NNP'), (u'is', 'VBZ'), (u'a', 'DT'), (u'mess', 'NN'), (u'up', 'RP'), (u'sentenc', 'NN'), (u'from', 'IN'), (u'the', 'DT'), (u'presid', 'JJ'), (u"'s", 'POS'), (u'Orama', 'NNP'), (u'and', 'CC'), (u'it', 'PRP'), (u"'s", 'VBZ'), (u'go', 'RB'), (u'to', 'TO'), (u'be', 'VB'), (u'sooo', 'RB'), (u'good', 'JJ'), (u',', ','), (u'you', 'PRP'), (u"'re", 'VBP'), (u'gon', 'JJ'), (u'na', 'NN'), (u'laugh', 'IN'), (u'.', '.')]

这是 tokenize -> POS 标签的结果

>>> pos_tag([word for word in word_tokenize(sent)])
[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('messed', 'VBN'), ('up', 'RP'), ('sentence', 'NN'), ('from', 'IN'), ('the', 'DT'), ('president', 'NN'), ("'s", 'POS'), ('Orama', 'NNP'), ('and', 'CC'), ('it', 'PRP'), ("'s", 'VBZ'), ('going', 'VBG'), ('to', 'TO'), ('be', 'VB'), ('sooo', 'RB'), ('good', 'JJ'), (',', ','), ('you', 'PRP'), ("'re", 'VBP'), ('gon', 'JJ'), ('na', 'NN'), ('laugh', 'IN'), ('.', '.')]

那么正确的方法是什么？

score 1 · Accepted Answer

我认为您不想在 POS 标记之前停止

在此处查看此示例：

如何在 NLTK 中使用词性标注

在 python 解释器中导入 NLTK 后，应在 pos 标记之前使用 word_tokenize，称为 pos_tag 方法：

>>> import nltk
>>> text = nltk.word_tokenize(“Dive into NLTK: Part-of-speech tagging and POS Tagger”)
>>> text
[‘Dive’, ‘into’, ‘NLTK’, ‘:’, ‘Part-of-speech’, ‘tagging’, ‘and’, ‘POS’, ‘Tagger’]
>>> nltk.pos_tag(text)
[(‘Dive’, ‘JJ’), (‘into’, ‘IN’), (‘NLTK’, ‘NNP’), (‘:’, ‘:’), (‘Part-of-speech’, ‘JJ’), (‘tagging’, ‘NN’), (‘and’, ‘CC’), (‘POS’, ‘NNP’), (‘Tagger’, ‘NNP’)]

python - 对词干分析器和 pos 标记器之间的优先级感到困惑

2 回答 2

Related

Reference