python - python scipy stats pareto fit：它是如何工作的

Question

...帮助和在线文档说函数 scipy.stats.pareto.fit 将要拟合的数据集和可选的 b（指数）、loc、比例作为变量。结果是三元组（指数，位置，比例）

从相同分布生成数据应该导致拟合找到用于生成数据的参数，例如（使用 python 3 colsole）

$  python
Python 3.3.0 (default, Dec 12 2012, 07:43:02) 
[GCC 4.7.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

（在下面的代码行中省略了 python 控制台提示“>>>”）

dataset=scipy.stats.pareto.rvs(1.5,size=10000)  #generating data
scipy.stats.pareto.fit(dataset)

然而这导致

(1.0, nan, 0.0)

（指数 1，应该是 1.5）和

dataset=scipy.stats.pareto.rvs(1.1,size=10000)  #generating data
scipy.stats.pareto.fit(dataset)

结果是

(1.0, nan, 0.0)

（指数 1，应该是 1.1）和

dataset=scipy.stats.pareto.rvs(4,loc=2.0,scale=0.4,size=10000)    #generating data
scipy.stats.pareto.fit(dataset)

（指数应为 4，loc 应为 2，比例应为 0.4）在

(1.0, nan, 0.0)

等在调用 fit 函数时给出另一个指数

scipy.stats.pareto.fit(dataset,1.4)

总是返回这个指数

(1.3999999999999999, nan, 0.0)

显而易见的问题是：我是否完全误解了这个 fit 函数的目的，它的使用方式是否有所不同，或者它只是被破坏了？

备注：在有人提到像 Aaron Clauset 的网页（http://tuvalu.santafe.edu/~aaronc/powerlaws/）上给出的专用功能比 scipy.stats 方法更可靠之前，应该使用它：可能是真的，但它们也非常非常非常非常耗时，并且对于 10000 个点的数据集，在普通 PC 上需要很多小时（可能是几天、几周、几年）。

编辑：哦：拟合函数的参数不是分布的指数而是指数负1（但这不会改变上述问题）

score 5 · Accepted Answer

看起来您必须为locand提供猜测scale：

In [78]: import scipy.stats as stats

In [79]: b, loc, scale = 1.5, 0, 1

In [80]: data = stats.pareto.rvs(b, size=10000)

In [81]: stats.pareto.fit(data, 1, loc=0, scale=1)
Out[81]: (1.5237427002368424, -2.8457847787917788e-05, 1.0000329980475393)

并且猜测必须非常准确才能使拟合成功：

In [82]: stats.pareto.fit(data, 1, loc=0, scale=1.01)
Out[82]: (1.5254113096223709, -0.0015898489208676779, 1.0015943893384001)

In [83]: stats.pareto.fit(data, 1, loc=0, scale=1.05)
Out[83]: (1.5234726749064218, 0.00025804526532994751, 0.99974649559141171)

In [84]: stats.pareto.fit(data, 1, loc=0.05, scale=1.05)
Out[84]: (1.0, 0.050000000000000003, 1.05)

希望问题的上下文会告诉你什么是合适的猜测loc和scale应该是什么。最有可能，loc=0和scale=1。

score 5 · Accepted Answer

fit 方法是一种非常通用且简单的方法，它会在分布的非负似然函数 (self.nnlf) 上优化.fmin。在诸如帕累托之类的具有可以创建未定义区域的参数的分布中，通用方法不起作用。

特别是，当随机变量的值不适合分布的有效性域时，一般的 nnlf 方法会返回“inf”。“fmin”优化器不能很好地处理这个目标函数，除非你已经猜到了与最终拟合非常接近的起始值。

In general, the .fit method needs to use a constrained optimizer for distributions where there are limits on the domain of applicability of the pdf.

python - python scipy stats pareto fit：它是如何工作的

2 回答 2

Related

Reference