python - 具有平均差（基尼）的性能 numpy 与 for 循环

Question

我的目标是找到一种快速解决方案来实现定性数据的平均差 (Gini)。由于某些数组可能有数百万个值，因此我寻找最快的实现。在编程时，我想知道为什么 for 循环比仅使用 numpy 函数的实现要快得多。这与我迄今为止对 Python 和 numpy 的了解完全相反：

import numpy as np

def gini(array: np.ndarray) -> float:
    """Calculate the Gini index of an array for value_counts of a pd.Series Values.
    https://en.wikipedia.org/wiki/Qualitative_variation#MNDif"""

    if len(array) == 1:
        gini_index = 0.0

    elif len(array) > 1:
        n = np.sum(array)
        k = len(array)
        #summation = np.sum(np.triu(np.abs(array[:, None] - array))) # version 1
        #summation = np.sum(np.abs(array[:, None] - array) / 2) # version 2

        # version3
        summation = 0
        for i in range(k-1):
            summation += np.sum(np.abs(array[i]-array[i+1:]))

        gini_index = 1 - (1/(n*(k-1))) * summation

    else:
        gini_index = np.nan

    return gini_index

a = np.ones(10000)
%timeit gini(a)

版本 1 的结果：每个循环 2.01 秒 ± 61.1 毫秒（平均值 ± 标准偏差。7 次运行，每个循环 1 个）

版本 2 的结果：每个循环 1.96 秒 ± 45.3 毫秒（平均值 ± 标准偏差，7 次运行，每个循环 1 个）

版本 3 的结果：每个循环 166 毫秒 ± 5.06 毫秒（平均值 ± 标准偏差，7 次运行，每个循环 1 个）

我想知道为什么带有 for 循环的版本是迄今为止最快的？

python - 具有平均差（基尼）的性能 numpy 与 for 循环

0 回答 0

Related

Reference