python - Python 函数查找唯一单词数/总单词数不起作用……为什么？

Question

为什么这段代码不起作用？

def hapax_legomana_ratio(text):
''' Return the hapax_legomana ratio for this text.
This ratio is the number of words that occur exactly once divided
by the total number of words.
text is a list of strings each ending in \n.
At least one line in text contains a word.'''

uniquewords=dict()
words=0
for line in text:
    line=line.strip().split()
    for word in line:
        words+=1
        if word in words:
            uniquewords[word]-=1
        else:
            uniquewords[word]=1
HLR=len(uniquewords)/words

print (HLR)

当我测试它时，它给了我错误的答案。例如，当 9 个字符串中有 3 个唯一单词时，它给我 0.20454545454545456 而不是 .33333。

score 2 · Accepted Answer

要找到比率：唯一单词的数量与文本中的总单词数：

from collections import Counter

def hapax_legomana_ratio(text):
    words = text.split() # a word is anything separated by a whitespace
    return sum(count == 1 for count in Counter(words).values()) / len(words)

它假定这text是一个字符串。相反，如果您有一个行列表，那么您可以获得words如下列表：

words = [word for line in all_lines for word in line.split()]

score 1 · Accepted Answer

您的代码中有很多谬误。我认为该行中有一个错字，if word in words因为它应该是uniquewords(dict) 而不是只是words(这是计数)。

更重要的是，您提供的文本应该分成几行，并且应该是这些行的列表。我宁愿建议这样做

for line in text.splitlines():

这样您就不必担心被传递的文本是list.

此外，您这样做len(uniquewords)是错误的，因为您将所有单词都存储在 dict 中，而不管它们是否唯一。单词的唯一性是通过将单词传递为，即 1 或 -1value从 dict 中获得的。key因此，您应该遍历字典的项目并将值为的键计数为1.

另外，你没有照顾标点符号！假设这是文本

这是一个测验，
是的，这是一个测试。

请注意test,& thetest.将如何以不同的方式存储在 words 中dict？

稍作修正的代码如下。

def hapax_legomana_ratio(text):
    ''' Return the hapax_legomana ratio for this text.
    This ratio is the number of words that occur exactly once divided
    by the total number of words.
    text is a list of strings each ending in \n.
    At least one line in text contains a word.'''

    uniquewords = dict()
    words = 0
    for line in text:
        line = line.strip().split()
        for word in line:
            words += 1
            word = word.replace(',', '').strip()
            if word in uniquewords:
                uniquewords[word] -= 1
            else:
                uniquewords[word] = 1

    unique_count = 0
    for each in uniquewords:
        if uniquewords[each] == 1:
            unique_count += 1
    HLR = unique_count/words

    print (HLR)

最后，如果这是一个非常大的项目和/或您将来也需要它，我宁愿建议使用该collection.Counter库来完成所有这些工作，而不是做所有这些工作。

python - Python 函数查找唯一单词数/总单词数不起作用……为什么？

2 回答 2

Related

Reference