python - Zipf 分布：如何测量 Zipf 分布

Question

如何测量或找到 Zipf 分布？例如，我有一个英语单词语料库。如何找到 Zipf 分布？我需要找到 Zipf 分布，然后绘制它的图表。但我被困在第一步，即找到 Zipf 分布。

编辑：从每个单词的频率计数来看，很明显它遵守 Zipf 定律。但我的目标是绘制一个 zipf 分布图。我不知道如何计算分布图的数据

score 7 · Accepted Answer

我不会假装理解统计数据。但是，根据从scipy 站点的阅读，这是一个幼稚的尝试python。

构建数据

首先我们得到我们的数据。例如，我们从 National Library of Medicine MeSH（医学主题词）ASCII 文件 d2016.bin (28 MB)下载数据。
接下来，我们打开文件，转换为字符串。

open_file = open('d2016.bin', 'r')
file_to_string = open_file.read()

接下来，我们在文件中定位单个单词并分离出单词。

words = re.findall(r'(\b[A-Za-z][a-z]{2,9}\b)', file_to_string)

最后，我们准备了一个字典，其中唯一的单词作为键，字数作为值。

for word in words:
    count = frequency.get(word,0)
    frequency[word] = count + 1

构建 zipf 分发数据
为了加快速度，我们将数据限制为 1000 字。

n = 1000
frequency = {key:value for key,value in frequency.items()[0:n]}

之后我们得到值的频率，转换为numpy数组并使用numpy.random.zipf函数从zipf分布中抽取样本。

分布参数a =2.作为样本，因为它需要大于 1。出于可见性目的，我们将数据限制为 50 个样本点。

s = frequency.values()
s = np.array(s)

count, bins, ignored = plt.hist(s[s<50], 50, normed=True)
x = np.arange(1., 50.)
y = x**(-a) / special.zetac(a)

最后绘制数据。

放在一起

import re
from operator import itemgetter
import matplotlib.pyplot as plt
from scipy import special
import numpy as np

#Get our corpus of medical words
frequency = {}
open_file = open('d2016.bin', 'r')
file_to_string = open_file.read()
words = re.findall(r'(\b[A-Za-z][a-z]{2,9}\b)', file_to_string)

#build dict of words based on frequency
for word in words:
    count = frequency.get(word,0)
    frequency[word] = count + 1

#limit words to 1000
n = 1000
frequency = {key:value for key,value in frequency.items()[0:n]}

#convert value of frequency to numpy array
s = frequency.values()
s = np.array(s)

#Calculate zipf and plot the data
a = 2. #  distribution parameter
count, bins, ignored = plt.hist(s[s<50], 50, normed=True)
x = np.arange(1., 50.)
y = x**(-a) / special.zetac(a)
plt.plot(x, y/max(y), linewidth=2, color='r')
plt.show()

阴谋

python - Zipf 分布：如何测量 Zipf 分布

1 回答 1

Related

Reference