根据https://stackoverflow.com/a/48981834/1840471,这是 Python 中加权基尼系数的实现:
import numpy as np
def gini(x, weights=None):
if weights is None:
weights = np.ones_like(x)
# Calculate mean absolute deviation in two steps, for weights.
count = np.multiply.outer(weights, weights)
mad = np.abs(np.subtract.outer(x, x) * count).sum() / count.sum()
rmad = mad / np.average(x, weights=weights)
# Gini equals half the relative mean absolute deviation.
return 0.5 * rmad
这很干净,适用于中型阵列,但正如其最初的建议 ( https://stackoverflow.com/a/39513799/1840471 ) 中所警告的那样,它是 O(n 2 )。在我的计算机上,这意味着它会在 ~20k 行后中断:
n = 20000 # Works, 30000 fails.
gini(np.random.rand(n), np.random.rand(n))
这可以调整为适用于更大的数据集吗?我的是~150k行。