python - 如何正确地将数据拟合到 Python 中的幂律？

Question

我正在考虑Moby Dick 小说中唯一单词的出现次数，并使用powerlawpython 包将单词的频率拟合到幂律。

我不知道为什么我不能概括 Clauset 等人以前工作的结果。因为 p 值和 KS 分数都是“差”。

这个想法是将独特词的频率拟合到幂律中。然而，Kolmogorov-Smirnov 测试的拟合优度由scipy.stats.kstest看起来很糟糕。

我有以下函数可以将数据拟合到幂律：

import numpy as np
import powerlaw
import scipy
from scipy import stats

def fit_x(x):
    fit = powerlaw.Fit(x, discrete=True)
    alpha = fit.power_law.alpha
    xmin  = fit.power_law.xmin
    print('powerlaw', scipy.stats.kstest(x, "powerlaw", args=(alpha, xmin), N=len(x)))
    print('lognorm', scipy.stats.kstest(x, "lognorm", args=(np.mean(x), np.std(x)), N=len(x)))

下载赫尔曼·梅尔维尔 (Herman Melville) 的小说《白鲸记》(Moby Dick) 中独特词的频率（根据 Aaron Clauset 等人的说法，应该遵循幂律）：

wget http://tuvalu.santafe.edu/~aaronc/powerlaws/data/words.txt

Python脚本：

x =  np.loadtxt('./words.txt')
fit_x(x)

结果：

('powerlaw', KstestResult(statistic=0.862264651286131, pvalue=0.0))
('log norm', KstestResult(statistic=0.9910368602492707, pvalue=0.0))

当我比较预期结果并在同一个 Moby Dick 数据集上遵循这个R 教程时，我得到了一个不错的 p 值和 KS 测试值：

library("poweRlaw")
data("moby", package="poweRlaw")
m_pl = displ$new(moby)
est = estimate_xmin(m_pl)
m_pl$setXmin(est)
bs_p = bootstrap_p(m_pl)
bs_p$p
## [1] 0.6738

在计算 KS 测试值并通过powerlaw python 库对拟合进行后处理时，我缺少什么？PDF 和 CDF 对我来说看起来不错，但 KS 测试看起来有问题。

score 1 · Accepted Answer

我认为你应该注意数据是连续的还是离散的，然后选择合适的测试方法；另外，前面说了，数据的大小会对结果有一定的影响，希望对你有帮助

score 0 · Accepted Answer

我仍然不清楚如何通过scipy.stats.kstest与powerlaw库一起使用来确定重要性和拟合优度。

虽然，powerlaw 实现了自己的distribution_compare能力，它返回似然比 R和（参见 Aaron Clausetp-val的R一些内容）：

R：两个分布拟合数据的浮点对数似然比。如果大于 0，则首选第一个分布。如果小于 0，则首选第二种分布。

p : float R 的意义

from numpy import genfromtxt
import urllib
import powerlaw

urllib.urlretrieve('https://raw.github.com/jeffalstott/powerlaw/master/manuscript/words.txt', 'words.txt')
words = genfromtxt('words.txt')

fit = powerlaw.Fit(words, discrete=True)

print(fit.distribution_compare('power_law', 'exponential', normalized_ratio=True))
(9.135914718776998, 6.485614241379581e-20)
print(fit.distribution_compare('power_law', 'truncated_power_law'))
(-0.917123083373983, 0.1756268316869548)
print(fit.distribution_compare('power_law', 'truncated_power_law'))
(-0.917123083373983, 0.1756268316869548)
print(fit.distribution_compare('power_law', 'lognormal'))
(0.008785246720842022, 0.9492243713193919)

python - 如何正确地将数据拟合到 Python 中的幂律？

2 回答 2

Related

Reference