我需要帮助制作一个条形图,显示文件中十个最常见单词的频率。每个条旁边是第二条,其高度是 Zipf 定律预测的频率。(例如,假设最常见的词出现 100 次。齐夫定律预测第二最常见的词应该出现大约 50 次(是最常见的一半),第三最常见的词应该出现大约 33 次(三分之一和最常见的一样频繁),第四个最常见的词出现大约 25 次(最常见的四分之一),依此类推)。
该函数将文本文件的名称(作为字符串)作为输入。
def zipf_graph(text_file):
import string
file = open(text_file, encoding = 'utf8')
text = file.read()
file.close()
punc = string.punctuation + '’”—⎬⎪“⎫'
new_text = text
for char in punc:
new_text = new_text.replace(char,'')
new_text = new_text.lower()
text_split = new_text.split()
# Determines how many times each word appears in the file.
from collections import Counter
word_and_freq = Counter(text_split)
top_ten_words = word_and_freq.most_common(10)
print(top_ten_words)
#graph info
import numpy as np
import matplotlib.pyplot as plt
barWidth = 0.25
bars1 = [1,2,3,4,5,6,7,8,9,10] # I want the top_ten_words here
bars2 = [10,5,3.33,2.5,2,1.67,1.43,1.25,1.11,1] # Zipf Law freq here, numbers are just ex.
r1 = np.arange(len(bars1))
r2 = [x + barWidth for x in r1]
plt.bar(r1, bars1, color='#7f6d5f', width=barWidth, edgecolor='white', label='Word')
plt.bar(r2, bars2, color='#2d7f5e', width=barWidth, edgecolor='white', label='Zipf Law')
plt.xlabel('group', fontweight='bold')
plt.xticks([r + barWidth for r in range(len(bars1))], ['word1', 'word2', 'word3', 'word4', 'word5', 'word6', 'word7', 'word8', 'word9', 'word10'])
# Want words to print below bars
plt.legend()
plt.show()
zipf_graph('gatsby.txt')
代码以这种格式打印前十个单词及其频率(例如,我使用了《了不起的盖茨比》一书):
[('the', 2573), ('and', 1594), ('a', 1451), ('of', 1233), ('to', 1209), ('i', 1178), ('in', 861), ('he', 797), ('was', 766), ('that', 596)]