python - Python：找到出现最多的词？

Question

我试图让我的程序报告文本文件中出现最多的单词。例如，如果我输入“你好，我喜欢馅饼，因为它们太棒了”，程序应该打印出“喜欢的次数最多”。执行选项 3 时出现此错误：KeyError: 'h'

#Prompt the user to enter a block of text.
done = False
textInput = ""
while(done == False):
    nextInput= input()
    if nextInput== "EOF":
        break
    else:
        textInput += nextInput

#Prompt the user to select an option from the Text Analyzer Menu.
print("Welcome to the Text Analyzer Menu! Select an option by typing a number"
    "\n1. shortest word"
    "\n2. longest word"
    "\n3. most common word"
    "\n4. left-column secret message!"
    "\n5. fifth-words secret message!"
    "\n6. word count"
    "\n7. quit")

#Set option to 0.
option = 0

#Use the 'while' to keep looping until the user types in Option 7.
while option !=7:
    option = int(input())

#The error occurs in this specific section of the code.
#If the user selects Option 3,
    elif option == 3:
        word_counter = {}
        for word in textInput:
            if word in textInput:
                word_counter[word] += 1
            else:
                word_counter[word] = 1

        print("The word that showed up the most was: ", word)

score 2 · Accepted Answer

我想你可能想做：

for word in textInput.split():
  ...

目前，您只是在遍历textInput. 因此，要遍历每个单词，我们必须首先将字符串拆分为单词数组。默认情况下.split()会在空格上拆分，但您可以通过将分隔符传递给split().

此外，您需要检查该单词是否在您的字典中，而不是在您的原始字符串中。所以试试：

if word in word_counter:
  ...

然后，找到出现次数最多的条目：

highest_word = ""
highest_value = 0

for k,v in word_counter.items():
  if v > highest_value:
    highest_value = v
    highest_word = k

然后，只需打印出highest_wordand的值highest_value。

要跟踪关系，只需保留最高单词的列表。如果我们发现更高的出现率，请清除列表并继续重建。这是到目前为止的完整程序：

textInput = "He likes eating because he likes eating"
word_counter = {}
for word in textInput.split():
  if word in word_counter:
    word_counter[word] += 1
  else:
    word_counter[word] = 1


highest_words = []
highest_value = 0

for k,v in word_counter.items():
  # if we find a new value, create a new list,
  # add the entry and update the highest value
  if v > highest_value:
    highest_words = []
    highest_words.append(k)
    highest_value = v
  # else if the value is the same, add it
  elif v == highest_value:
    highest_words.append(k)

# print out the highest words
for word in highest_words:
  print word

score 2 · Accepted Answer

与其滚动您自己的计数器，不如在集合模块中使用计数器。

>>> input = 'blah and stuff and things and stuff'
>>> from collections import Counter
>>> c = Counter(input.split())
>>> c.most_common()
[('and', 3), ('stuff', 2), ('things', 1), ('blah', 1)]

另外，作为一般的代码风格，请避免添加这样的注释：

#Set option to 0.
option = 0

它使您的代码可读性降低，而不是更多。

score 1 · Accepted Answer

最初的答案当然是正确的，但您可能要记住，它不会向您显示“优先领带”。像这样的一句话

A life in the present is a present itself.

只会显示“a”或“present”是排名第一的热门歌曲。事实上，由于字典（通常）是无序的，因此您看到的结果甚至可能不是重复多次的第一个单词。

如果您需要报告倍数，我可以建议以下内容：

1) 使用您当前的键值对方法来获取 'word':'hits'。
2) 确定“命中”的最大值。
3）检查等于最大命中数的值的数量，并将这些键添加到列表中。
4) 遍历列表以显示命中次数最多的单词。

标准杆示例：

greatestNumber = 0
# establish the highest number for wordCounter.values()
for hits in wordCounter.values():
    if hits > greatestNumber:
        greatestNumber = hits

topWords = []
#find the keys that are paired to that value and add them to a list
#we COULD just print them as we iterate, but I would argue that this
#makes this function do too much
for word in wordCounter.keys():
    if wordCounter[word] == greatestNumber:
        topWords.append(word)

#now reveal the results
print "The words that showed up the most, with %d hits:" % greatestNumber
for word in topWords:
    print word

根据 Python 2.7 或 Python 3，您的里程（和语法）可能会有所不同。但理想情况下-恕我直言-您首先要确定最大的点击数，然后返回并将相关条目添加到新列表中。

编辑 - 您可能应该按照不同答案中的建议使用 Counters 模块。我什至不知道这是 Python 刚准备做的事情。哈哈不要接受我的回答，除非你一定要自己写计数器！似乎已经有一个模块。

score 0 · Accepted Answer

使用 Python 3.6+，您可以使用statistics.mode：

>>> from statistics import mode
>>> mode('Hello I like pie because they are like so good'.split())
'like'

score -1 · Accepted Answer

I'm not too keen on Python, but on your last print statement, shouldn't you have a %s?

i.e.: print("The word that showed up the most was: %s", word)

python - Python：找到出现最多的词？

5 回答 5

Related

Reference