python - Python - 使用 Numpy 计算基尼系数

Question

我是一个新手，首先，刚开始学习Python，我正在尝试编写一些代码来计算一个假国家的基尼指数。我想出了以下几点：

GDP = (653200000000)
A = (0.49 * GDP) / 100 # Poorest 10%
B = (0.59 * GDP) / 100
C = (0.69 * GDP) / 100
D = (0.79 * GDP) / 100
E = (1.89 * GDP) / 100
F = (2.55 * GDP) / 100
G = (5.0 * GDP) / 100
H = (10.0 * GDP) / 100
I = (18.0 * GDP) / 100
J = (60.0 * GDP) / 100 # Richest 10%

# Divide into quintiles and total income within each quintile
Q1 = float(A + B) # lowest quintile
Q2 = float(C + D) # second quintile
Q3 = float(E + F) # third quintile
Q4 = float(G + H) # fourth quintile
Q5 = float(I + J) # fifth quintile

# Calculate the percent of total income in each quintile
T1 = float((100 * Q1) / GDP) / 100
T2 = float((100 * Q2) / GDP) / 100
T3 = float((100 * Q3) / GDP) / 100
T4 = float((100 * Q4) / GDP) / 100
T5 = float((100 * Q5) / GDP) / 100

TR = float(T1 + T2 + T3 + T4 + T5)

# Calculate the cumulative percentage of household income
H1 = float(T1)
H2 = float(T1+T2)
H3 = float(T1+T2+T3)
H4 = float(T1+T2+T3+T4)
H5 = float(T1+T2+T3+T4+T5)

# Magic! Using numpy to calculate area under Lorenz curve.
# Problem might be here?
import numpy as np 
from numpy import trapz

# The y values. Cumulative percentage of incomes
y = np.array([Q1,Q2,Q3,Q4,Q5])

# Compute the area using the composite trapezoidal rule.
area_lorenz = trapz(y, dx=5)

# Calculate the area below the perfect equality line.
area_perfect = (Q5 * H5) / 2

# Seems to work fine until here. 
# Manually calculated Gini using the values given for the areas above 
# turns out at .58 which seems reasonable?

Gini = area_perfect - area_lorenz

# Prints utter nonsense.
print Gini

Gini = area_perfect - area_lorenz只是没有意义的结果。我已经取出了区域变量给出的值并手动进行了数学计算，结果还不错，但是当我尝试让程序去做时，它给了我一个完全？？？值（-1.7198 ...）。我错过了什么？有人可以指出我正确的方向吗？

谢谢！

score 4 · Accepted Answer

第一个问题是没有正确考虑基尼系数方程：

gini =（洛伦兹曲线和完全相等之间的面积）/（完全相等下的面积）

分母 in 未包含在计算中，并且还使用了等式线下面积的不正确方程（参见使用np.linspace和np.trapz的方法的代码）。

还有一个问题是缺少洛伦兹曲线的第一个点（它需要从 0 开始，而不是第一个五分位数的份额）。虽然洛伦兹曲线下面积在 0 到第一个五分位数之间很小，但它与延伸后的等值线下面积的比值是相当大的。

以下提供了对此问题的答案中给出的方法的等效答案：

import numpy as np
    
GDP = 653200000000 # this value isn't actually needed
    
# Decile percents of global GDP
gdp_decile_percents = [0.49, 0.59, 0.69, 0.79, 1.89, 2.55, 5.0, 10.0, 18.0, 60.0]
print('Percents sum to 100:', sum(gdp_decile_percents) == 100)
    
gdp_decile_shares = [i/100 for i in gdp_decile_percents]
    
# Convert to quintile shares of total GDP
gdp_quintile_shares = [(gdp_decile_shares[i] + gdp_decile_shares[i+1]) for i in range(0, len(gdp_decile_shares), 2)]
    
# Insert 0 for the first value in the Lorenz curve
gdp_quintile_shares.insert(0, 0)
    
# Cumulative sum of shares (Lorenz curve values)
shares_cumsum = np.cumsum(a=gdp_quintile_shares, axis=None)
    
# Perfect equality line
pe_line = np.linspace(start=0.0, stop=1.0, num=len(shares_cumsum))

area_under_lorenz = np.trapz(y=shares_cumsum, dx=1/len(shares_cumsum))
area_under_pe = np.trapz(y=pe_line, dx=1/len(shares_cumsum))
    
gini = (area_under_pe - area_under_lorenz) / area_under_pe
    
print('Gini coefficient:', gini)

计算的面积np.trapz系数为 0.67。没有 Lorenz 曲线的第一个点并使用 trapz 计算的值是 0.59。我们对全局不等式的计算现在大致等于上面链接的问题中的方法提供的计算（您不需要在这些方法中的列表/数组中添加 0）。请注意，使用scipy.integrate.simps得到 0.69，这意味着另一个问题中的方法更符合梯形而不是辛普森积分。

这是情节，其中包括plt.fill_between在洛伦兹曲线下着色：

from matplotlib import pyplot as plt

plt.plot(pe_line, shares_cumsum, label='lorenz_curve')
plt.plot(pe_line, pe_line, label='perfect_equality')
plt.fill_between(pe_line, shares_cumsum)
plt.title('Gini: {}'.format(gini), fontsize=20)
plt.ylabel('Cummulative Share of Global GDP', fontsize=15)
plt.xlabel('Income Quintiles (Lowest to Highest)', fontsize=15)
plt.legend()
plt.tight_layout()
plt.show()

由此产生的基尼曲线。

score 1 · Accepted Answer

星尘。

你的问题不在于numpy.trapz; 它与 1）您对完美平等分布的定义，以及 2）基尼系数的归一化。

首先，您将完全平等分布定义为Q5*H5/2，它是第五个五分之一的收入与累积百分比 (1.0) 的乘积的一半。我不确定这个数字代表什么。

其次，您必须按完全相等分布下的区域进行归一化；IE：

gini =（完全等式下的面积 - 洛伦兹下的面积）/（完全等式下的面积）

如果您将完美等式曲线定义为面积为 1，则不必担心这一点，但这是一个很好的保护措施，以防您在完美等式曲线的定义中出现错误。

为了解决这两个问题，我用numpy.linspace. 这样做的第一个优点是您可以使用真实分布的属性以相同的方式定义它。换句话说，无论您使用四分位数、五分位数还是十分位数，完全等式 CDF ( y_pe, 下面) 都将具有正确的形状。第二个优点是计算它的面积也是通过numpy.trapz一些并行性来完成的，这使得代码更易于阅读并防止错误计算。

import numpy as np
from matplotlib import pyplot as plt
from numpy import trapz

GDP = (653200000000)
A = (0.49 * GDP) / 100 # Poorest 10%
B = (0.59 * GDP) / 100
C = (0.69 * GDP) / 100
D = (0.79 * GDP) / 100
E = (1.89 * GDP) / 100
F = (2.55 * GDP) / 100
G = (5.0 * GDP) / 100
H = (10.0 * GDP) / 100
I = (18.0 * GDP) / 100
J = (60.0 * GDP) / 100 # Richest 10%

# Divide into quintiles and total income within each quintile
Q1 = float(A + B) # lowest quintile
Q2 = float(C + D) # second quintile
Q3 = float(E + F) # third quintile
Q4 = float(G + H) # fourth quintile
Q5 = float(I + J) # fifth quintile

# Calculate the percent of total income in each quintile
T1 = float((100 * Q1) / GDP) / 100
T2 = float((100 * Q2) / GDP) / 100
T3 = float((100 * Q3) / GDP) / 100
T4 = float((100 * Q4) / GDP) / 100
T5 = float((100 * Q5) / GDP) / 100

TR = float(T1 + T2 + T3 + T4 + T5)

# Calculate the cumulative percentage of household income
H1 = float(T1)
H2 = float(T1+T2)
H3 = float(T1+T2+T3)
H4 = float(T1+T2+T3+T4)
H5 = float(T1+T2+T3+T4+T5)

# The y values. Cumulative percentage of incomes
y = np.array([H1,H2,H3,H4,H5])

# The perfect equality y values. Cumulative percentage of incomes.
y_pe = np.linspace(0.0,1.0,len(y))

# Compute the area using the composite trapezoidal rule.
area_lorenz = np.trapz(y, dx=5)

# Calculate the area below the perfect equality line.
area_perfect = np.trapz(y_pe, dx=5)

# Seems to work fine until here. 
# Manually calculated Gini using the values given for the areas above 
# turns out at .58 which seems reasonable?

Gini = (area_perfect - area_lorenz)/area_perfect

print Gini

plt.plot(y,label='lorenz')
plt.plot(y_pe,label='perfect_equality')
plt.legend()
plt.show()

python - Python - 使用 Numpy 计算基尼系数

2 回答 2

Related

Reference