-2

我需要帮助制作一个条形图,显示文件中十个最常见单词的频率。每个条旁边是第二条,其高度是 Zipf 定律预测的频率。(例如,假设最常见的词出现 100 次。齐夫定律预测第二最常见的词应该出现大约 50 次(是最常见的一半),第三最常见的词应该出现大约 33 次(三分之一和最常见的一样频繁),第四个最常见的词出现大约 25 次(最常见的四分之一),依此类推)。

该函数将文本文件的名称(作为字符串)作为输入。

def zipf_graph(text_file):
    import string
    file = open(text_file, encoding = 'utf8')
    text = file.read()
    file.close()

    punc = string.punctuation + '’”—⎬⎪“⎫'
    new_text = text
    for char in punc:
        new_text = new_text.replace(char,'')
        new_text = new_text.lower()
    text_split = new_text.split()

    # Determines how many times each word appears in the file. 
    from collections import Counter
    word_and_freq = Counter(text_split)
    top_ten_words = word_and_freq.most_common(10)

    print(top_ten_words) 

    #graph info

    import numpy as np
    import matplotlib.pyplot as plt
    barWidth = 0.25
    bars1 = [1,2,3,4,5,6,7,8,9,10] # I want the top_ten_words here
    bars2 = [10,5,3.33,2.5,2,1.67,1.43,1.25,1.11,1] # Zipf Law freq here, numbers are just ex.

    r1 = np.arange(len(bars1))
    r2 = [x + barWidth for x in r1]

    plt.bar(r1, bars1, color='#7f6d5f', width=barWidth, edgecolor='white', label='Word')
    plt.bar(r2, bars2, color='#2d7f5e', width=barWidth, edgecolor='white', label='Zipf Law')
    plt.xlabel('group', fontweight='bold')
    plt.xticks([r + barWidth for r in range(len(bars1))], ['word1', 'word2', 'word3', 'word4', 'word5', 'word6', 'word7', 'word8', 'word9', 'word10']) 
    # Want words to print below bars
    plt.legend()
    plt.show()

zipf_graph('gatsby.txt')

代码以这种格式打印前十个单词及其频率(例如,我使用了《了不起的盖茨比》一书):

[('the', 2573), ('and', 1594), ('a', 1451), ('of', 1233), ('to', 1209), ('i', 1178), ('in', 861), ('he', 797), ('was', 766), ('that', 596)]
4

2 回答 2

2

这个解决方案对我有用。一些注意事项:

  • 我更喜欢使用 Pandas 来收集我的数据集。
  • 您需要一个通过 zipf 法则返回预期频率的函数。我锚定最频繁的,但另一种方法是锚定总数(前 10 名)。
import pandas as pd

def zipf_frequency(most_common_count, n=10):
    zipf_law = []
    for x in range(1, n+1):
        zipf_law.append(most_common_count/(x))
    return zipf_law

top_ten_words_df = pd.DataFrame(top_ten_words, columns=['word', 'actual count'])
top_ten_words_df['expected zipf frequency'] = zipf_frequency(top_ten_words_df.loc[0, 'actual count'])

fig, ax = plt.subplots()
top_ten_words_df.plot(kind='bar', ax=ax)
ax.set_xticklabels(top_ten_words_df['word'])
fig.tight_layout()

条形图

于 2021-03-02T20:30:18.457 回答
1

Matplotlib。这是一个演示

import matplotlib.pyplot as plt; plt.rcdefaults()
import numpy as np
import matplotlib.pyplot as plt

objects = ('Python', 'C++', 'Java', 'Perl', 'Scala', 'Lisp')
y_pos = np.arange(len(objects))
performance = [10,8,6,4,2,1]

plt.bar(y_pos, performance, align='center', alpha=0.5)
plt.xticks(y_pos, objects)
plt.ylabel('Usage')
plt.title('Programming language usage')

plt.show()
于 2021-03-02T19:23:51.213 回答