python - 用于标记文本的 Python NLTK 搭配

Question

我不确定这是否可能，但我想我会问以防万一。例如，假设您有一个“body | tags”形式的示例数据集

"I went to the store and bought some bread" | shopping food

我想知道是否有一种方法可以使用 NLTK 搭配来计算数据集中正文词和标签词同时出现的次数。一个示例可能类似于 ("bread","food",598)，其中“bread”是主体词，“food”是标签词，598 是它们在数据集中同时出现的次数

score 0 · Accepted Answer

在不使用 NLTK 的情况下，您可以这样做：

from collections import Counter
from itertools import product

documents = '''"foo bar is not a sentence" | tag1
"bar bar black sheep is not a real sheep" | tag2
"what the bar foo is not a foo bar" | tag1'''

documents = [i.split('|')[0].strip('" ') for i in documents.split('\n')]

collocations = Counter()

for i in documents:
    # Get all the possible word collocations with product
    # NOTE: this includes a token with itself. so we need 
    #       to remove the count for the token with itself.
    x = Counter(list(product(i.split(),i.split()))) \
            - Counter([(i,i) for i in i.split()])
    collocations+=x


for i in collocations:
    print i, collocations[i]

您将遇到如何计算句子中相同单词的搭配的问题，例如，

吧吧黑羊不是真羊

('bar','bar') 的搭配计数是多少？是 2 of 1 吗？上面的代码给出了 2，因为第一个小节与第二个小节搭配，而第二个小节与第一个小节搭配。

python - 用于标记文本的 Python NLTK 搭配

1 回答 1

Related

Reference