python - 寻找合适的临界值

Question

我尝试实现Hampel tanh 估计器来标准化高度不对称的数据。为此，我需要执行以下计算：

给定x- 一个排序的数字列表和m- 的中位数x，我需要找到a大约 70% 的值x落入范围内(m-a; m+a)。我们对中的值分布一无所知x。我使用 numpy 在 python 中编写，我最好的想法是编写某种随机迭代搜索（例如，如Solis 和 Wets所述），但我怀疑有更好的方法，或者以更好的算法或作为现成的功能。我搜索了 numpy 和 scipy 文档，但找不到任何有用的提示。

编辑

Seth 建议使用 scipy.stats.mstats.trimboth，但是在我对倾斜分布的测试中，这个建议不起作用：

from scipy.stats.mstats import trimboth
import numpy as np

theList = np.log10(1+np.arange(.1, 100))
theMedian = np.median(theList)

trimmedList = trimboth(theList, proportiontocut=0.15)
a = (trimmedList.max() - trimmedList.min()) * 0.5

#check how many elements fall into the range
sel = (theList > (theMedian - a)) * (theList < (theMedian + a))

print np.sum(sel) / float(len(theList))

输出为 0.79（~80%，而不是 70）

score 2 · Accepted Answer

您需要首先通过将所有小于平均值的值向右折叠来对称分布。然后你可以scipy.stats在这个单向分布上使用标准函数：

from scipy.stats import scoreatpercentile
import numpy as np

theList = np.log10(1+np.arange(.1, 100))
theMedian = np.median(theList)

oneSidedList = theList[:]               # copy original list
# fold over to the right all values left of the median
oneSidedList[theList < theMedian] = 2*theMedian - theList[theList < theMedian]

# find the 70th centile of the one-sided distribution
a = scoreatpercentile(oneSidedList, 70) - theMedian

#check how many elements fall into the range
sel = (theList > (theMedian - a)) * (theList < (theMedian + a))

print np.sum(sel) / float(len(theList))

这给出了0.7所需的结果。

score 1 · Accepted Answer

稍微重述问题。您知道列表的长度，以及要考虑的列表中数字的比例。鉴于此，您可以确定列表中为您提供所需范围的第一个和最后一个索引之间的差异。然后的目标是找到将最小化与所需的关于中位数的对称值相对应的成本函数的索引。

设较小的索引为n1，较大的索引为n2; 这些不是独立的。索引列表中的值为x[n1] = m-b和x[n2]=m+c。您现在想要选择n1（因此n2），以便b和c尽可能接近。这发生在(b - c)**2最小的时候。这很容易使用numpy.argmin。与问题中的示例平行，这是一个说明该方法的交互式会话：

$ python
Python 2.6.5 (r265:79063, Jun 12 2010, 17:07:01)
[GCC 4.3.4 20090804 (release) 1] on cygwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> theList = np.log10(1+np.arange(.1, 100))
>>> theMedian = np.median(theList)
>>> listHead = theList[0:30]
>>> listTail = theList[-30:]
>>> b = np.abs(listHead - theMedian)
>>> c = np.abs(listTail - theMedian)
>>> squaredDiff = (b - c) ** 2
>>> np.argmin(squaredDiff)
25
>>> listHead[25] - theMedian, listTail[25] - theMedian
(-0.2874888056626983, 0.27859407466756614)

score 0 · Accepted Answer

你想要的是scipy.stats.mstats.trimboth。设置proportiontocut=0.15。修剪后，取(max-min)/2。

python - 寻找合适的临界值

3 回答 3

Related

Reference