python - 取某些单词并打印每个短语/单词的频率？

Question

我有一个文件，其中包含乐队列表、专辑和制作年份。我需要编写一个函数来遍历这个文件并找到波段的不同名称，并计算每个波段在这个文件中出现的次数。

文件的样子是这样的：

Beatles - Revolver (1966)
Nirvana - Nevermind (1991)
Beatles - Sgt Pepper's Lonely Hearts Club Band (1967)
U2 - The Joshua Tree (1987)
Beatles - The Beatles (1968)
Beatles - Abbey Road (1969)
Guns N' Roses - Appetite For Destruction (1987)
Radiohead - Ok Computer (1997)
Led Zeppelin - Led Zeppelin 4 (1971)
U2 - Achtung Baby (1991)
Pink Floyd - Dark Side Of The Moon (1973)
Michael Jackson -Thriller (1982)
Rolling Stones - Exile On Main Street (1972)
Clash - London Calling (1979)
U2 - All That You Can't Leave Behind (2000)
Weezer - Pinkerton (1996)
Radiohead - The Bends (1995)
Smashing Pumpkins - Mellon Collie And The Infinite Sadness (1995)
.
.
.

输出必须按频率降序排列，如下所示：

band1: number1
band2: number2
band3: number3

这是我到目前为止的代码：

def read_albums(filename) :

    file = open("albums.txt", "r")
    bands = {}
    for line in file :
        words = line.split()
        for word in words:
            if word in '-' :
                del(words[words.index(word):])
        string1 = ""
        for i in words :
            list1 = []

            string1 = string1 + i + " "
            list1.append(string1)
        for k in list1 :
            if (k in bands) :
                bands[k] = bands[k] +1
            else :
                bands[k] = 1


    for word in bands :
        frequency = bands[word]
        print(word + ":", len(bands))

我认为有一种更简单的方法可以做到这一点，但我不确定。另外，我不确定如何按频率对字典进行排序，是否需要将其转换为列表？

score 2 · Accepted Answer

你是对的，有一个更简单的方法，有Counter：

from collections import Counter

with open('bandfile.txt') as f:
   counts = Counter(line.split('-')[0].strip() for line in f if line)

for band, count in counts.most_common():
    print("{0}:{1}".format(band, count))

这到底是做什么的：line.split('-')[0].strip() for line in f if line？

此行是以下循环的长形式：

temp_list = []
for line in f:
    if line: # this makes sure to skip blank lines
      bits = line.split('-')
      temp_list.add(bits[0].strip())

counts = Counter(temp_list)

然而，与上面的循环不同 - 它不会创建中间列表。相反，它创建了一个生成器表达式——一种更高效的方式来逐步处理事物；用作的参数Counter。

score 1 · Accepted Answer

如果您正在寻找简洁性，请使用“defaultdict”和“sorted”

from collections import defaultdict
bands = defaultdict(int)
with open('tmp.txt') as f:
   for line in f.xreadlines():
       band = line.split(' - ')[0]
       bands[band] += 1
for band, count in sorted(bands.items(), key=lambda t: t[1], reverse=True):
    print '%s: %d' % (band, count)

score 0 · Accepted Answer

我的方法是使用该split()方法将文件行分解为组成标记的列表。然后您可以获取乐队名称（列表中的第一个标记），并开始将名称添加到字典中以跟踪计数：

import operator

def main():
  f = open("albums.txt", "rU")
  band_counts = {}

  #build a dictionary that adds each band as it is listed, then increments the count for re-lists
  for line in f:
    line_items = line.split("-") #break up the line into individual tokens
    band = line_items[0]

  #don't want to add newlines to the band list
  if band == "\n":
    continue

  if band in band_counts:
    band_counts[band] += 1 #band already in the counts, increment the counts
  else:
    band_counts[band] = 1  #if the band was not already in counts, add it with a count of 1

  #create a list of sorted results
  sorted_list = sorted(band_counts.iteritems(), key=operator.itemgetter(1))

  for item in sorted_list:
    print item[0], ":", item[1]

笔记：

我按照这个答案的建议来创建排序结果：Sort a Python dictionary by value
如果您是 Python 新手，请查看 Google 的 Python 课程。我刚开始时发现它非常有用：https ://developers.google.com/edu/python/?csw=1

python - 取某些单词并打印每个短语/单词的频率？

3 回答 3

Related

Reference