python - 在 .txt 文件中查找最常用单词的 Python 程序，必须打印单词及其计数

Question

截至目前，我有一个替换 countChars 函数的函数，

def countWords(lines):
  wordDict = {}
  for line in lines:
    wordList = lines.split()
    for word in wordList:
      if word in wordDict: wordDict[word] += 1
      else: wordDict[word] = 1
  return wordDict

但是当我运行程序时，它会吐出这个可憎的东西（这只是一个例子，旁边有大约两页字数很大）

before 1478
battle-field 1478
as 1478
any 1478
altogether 1478
all 1478
ago 1478
advanced. 1478
add 1478
above 1478

虽然这显然意味着代码足够健全，可以运行，但我并没有从中得到我想要的。它需要打印文件中每个单词有多少次（gb.txt，这是葛底斯堡地址）显然文件中的每个单词都不在那里精确 1478 次..

我对编程很陌生，所以我有点难过..

from __future__ import division

inputFileName = 'gb.txt'

def readfile(fname):
  f = open(fname, 'r')
  s = f.read()
  f.close()
 return s.lower()

def countChars(t):
  charDict = {}
  for char in t:
    if char in charDict: charDict[char] += 1
    else: charDict[char] = 1
  return charDict

def findMostCommon(charDict):
  mostFreq = ''
  mostFreqCount = 0
  for k in charDict:
    if charDict[k] > mostFreqCount:
      mostFreqCount = charDict[k]
      mostFreq = k
  return mostFreq

def printCounts(charDict):
  for k in charDict:
    #First, handle some chars that don't show up very well when they print
    if k == '\n': print '\\n', charDict[k]  #newline
    elif k == ' ': print 'space', charDict[k]
    elif k == '\t': print '\\t', charDict[k] #tab
    else: print k, charDict[k]  #Normal character - print it with its count

def printAlphabetically(charDict):
  keyList = charDict.keys()
  keyList.sort()
  for k in keyList:
    #First, handle some chars that don't show up very well when they print
    if k == '\n': print '\\n', charDict[k]  #newline
    elif k == ' ': print 'space', charDict[k]
    elif k == '\t': print '\\t', charDict[k] #tab
    else: print k, charDict[k]  #Normal character - print it with its count

def printByFreq(charDict):
  aList = []
  for k in charDict:
    aList.append([charDict[k], k])
  aList.sort()     #Sort into ascending order
  aList.reverse()  #Put in descending order
  for item in aList:
    #First, handle some chars that don't show up very well when they print
    if item[1] == '\n': print '\\n', item[0]  #newline
    elif item[1] == ' ': print 'space', item[0]
    elif item[1] == '\t': print '\\t', item[0] #tab
    else: print item[1], item[0]  #Normal character - print it with its count

def main():
  text = readfile(inputFileName)
  charCounts = countChars(text)
  mostCommon = findMostCommon(charCounts)
  #print mostCommon + ':', charCounts[mostCommon]
  #printCounts(charCounts)
  #printAlphabetically(charCounts)
  printByFreq(charCounts)

main()

score 25 · Accepted Answer

如果您需要计算文章中的单词数，那么最好使用正则表达式。

让我们从一个简单的例子开始：

import re

my_string = "Wow! Is this true? Really!?!? This is crazy!"

words = re.findall(r'\w+', my_string) #This finds words in the document

结果：

>>> words
['Wow', 'Is', 'this', 'true', 'Really', 'This', 'is', 'crazy']

请注意，“是”和“是”是两个不同的词。我的猜测是，您希望对它们进行相同的计数，因此我们可以将所有单词大写，然后对它们进行计数。

from collections import Counter

cap_words = [word.upper() for word in words] #capitalizes all the words

word_counts = Counter(cap_words) #counts the number each time a word appears

结果：

>>> word_counts
Counter({'THIS': 2, 'IS': 2, 'CRAZY': 1, 'WOW': 1, 'TRUE': 1, 'REALLY': 1})

你到这里还好吗？

现在我们需要做和上面一样的事情，只是这次我们正在读取一个文件。

import re
from collections import Counter

with open('your_file.txt') as f:
    passage = f.read()

words = re.findall(r'\w+', passage)

cap_words = [word.upper() for word in words]

word_counts = Counter(cap_words)

score 17 · Accepted Answer

如果您使用强大的工具，这个程序实际上是一个 4 行：

with open(yourfile) as f:
    text = f.read()

words = re.compile(r"[\w']+", re.U).findall(text)   # re.U == re.UNICODE
counts = collections.Counter(words)

正则表达式将查找所有单词，而不管与它们相邻的标点符号（但将撇号计为单词的一部分）。

计数器的作用几乎就像字典一样，但您可以执行counts.most_common(10)、和添加计数等操作。请参阅help(Counter)

我还建议您不要制作 functions printBy...，因为只有没有副作用的函数才易于重用。

def countsSortedAlphabetically(counter, **kw):
    return sorted(counter.items(), **kw)

#def countsSortedNumerically(counter, **kw):
#    return sorted(counter.items(), key=lambda x:x[1], **kw)
#### use counter.most_common(n) instead

# `from pprint import pprint as pp` is also useful
def printByLine(tuples):
    print( '\n'.join(' '.join(map(str,t)) for t in tuples) )

演示：

>>> words = Counter(['test','is','a','test'])
>>> printByLine( countsSortedAlphabetically(words, reverse=True) )
test 2
is 1
a 1

编辑以解决 Mateusz Konieczny 的评论：用 [\w'] 替换 [a-zA-Z'] ... 字符类 \w，根据 python 文档，“匹配 Unicode 单词字符；这包括大多数可以任何语言的单词的一部分，以及数字和下划线。如果使用 ASCII 标志，则仅匹配 [a-zA-Z0-9_]。（...但显然不匹配撇号...）但是 \w 包括 _ 和 0-9，所以如果你不想要这些并且你没有使用 unicode，你可以使用 [a-zA -Z']; 如果您正在使用 unicode，您需要做一个否定断言或从 \w 字符类中减去 [0-9_] 的东西

score 3 · Accepted Answer

~~你有一个简单的错字，words你想要的地方word。~~

~~编辑：您似乎已经编辑了源代码。请使用复制和粘贴在第一时间正确处理。~~

编辑2：显然你不是唯一一个容易出现拼写错误的人。真正的问题是你有lines你想要的地方line。对于指责您编辑源代码，我深表歉意。

score 3 · Accepted Answer

 words = ['red', 'green', 'black', 'pink', 'black', 'white', 'black', 
'eyes','white', 'black', 'orange', 'pink', 'pink', 'red', 'red', 
'white', 'orange', 'white', "black", 'pink', 'green', 'green', 'pink', 
'green', 'pink','white', 'orange', "orange", 'red']

 from collections import Counter
 counts = Counter(words)
 top_four = counts.most_common(4)
 print(top_four)

score 2 · Accepted Answer

这是一个可能的解决方案，不像 ninjagecko 那样优雅，但仍然：

from collections import defaultdict

dicto = defaultdict(int)

with open('yourfile.txt') as f:
    for line in f:
        s_line = line.rstrip().split(',') #assuming ',' is the delimiter
        for ele in s_line:
            dicto[ele] += 1

 #dicto contians words as keys, word counts as values

 for k,v in dicto.iteritems():
     print k,v

score 0 · Accepted Answer

导入集合并定义函数

from collections import Counter 
def most_count(n):
  split_it = data_set.split() 
  b=Counter(split_it)  
  return b.most_common(n)

调用指定您想要的前 'n' 个单词的函数。在我的情况下 n=15

most_count(15)

python - 在 .txt 文件中查找最常用单词的 Python 程序，必须打印单词及其计数

6 回答 6

Related

Reference