python - 如何绘制经验 cdf (ecdf)

Question

如何在 Python 中绘制 matplotlib 中数字数组的经验 CDF？我正在寻找 pylab 的“hist”函数的 cdf 模拟。

我能想到的一件事是：

from scipy.stats import cumfreq
a = array([...]) # my array of numbers
num_bins =  20
b = cumfreq(a, num_bins)
plt.plot(b)

score 118 · Accepted Answer

如果您喜欢linspace并喜欢单线，您可以执行以下操作：

plt.plot(np.sort(a), np.linspace(0, 1, len(a), endpoint=False))

鉴于我的口味，我几乎总是这样做：

# a is the data array
x = np.sort(a)
y = np.arange(len(x))/float(len(x))
plt.plot(x, y)

即使有>O(1e6)数据值，它也对我有用。如果你真的需要下采样，我会设置

x = np.sort(a)[::down_sampling_step]

编辑以回应评论/编辑我为什么使用endpoint=False或y上面定义的。以下是一些技术细节。

经验 CDF 通常正式定义为

CDF(x) = "number of samples <= x"/"number of samples"

为了完全匹配这个正式的定义，你需要使用y = np.arange(1,len(x)+1)/float(len(x))这样我们得到 y = [1/N, 2/N ... 1]. 这个估计器是一个无偏估计器，它将在无限样本维基百科参考的限制下收敛到真正的 CDF 。.

我倾向于使用y = [0, 1/N, 2/N ... (N-1)/N]，因为（a）它更容易编码/更惯用，（b）但仍然是形式上合理的，因为人们总是可以在收敛证明中进行交换CDF(x)，1-CDF(x)并且（c）使用上面描述的（简单）下采样方法.

在某些特定情况下，定义

y = (arange(len(x))+0.5)/len(x)

介于这两个约定之间。实际上，它说“有1/(2N)可能小于我在样本中看到的最低值，也有1/(2N)可能大于我迄今为止看到的最大值。

请注意，此约定的选择与ifwhere中使用的参数相互作用，plt.step将 CDF 显示为分段常数函数似乎更有用。为了与上面提到的正式定义完全匹配，需要使用where=pre建议的y=[0,1/N..., 1-1/N]约定，或者where=post使用y=[1/N, 2/N ... 1]约定，而不是相反。

但是，对于大样本和合理分布，答案主体中给出的约定很容易编写，是真实 CDF 的无偏估计量，并且适用于下采样方法。

score 79 · Accepted Answer

您可以使用scikits.statsmodels库中的ECDF函数：

import numpy as np
import scikits.statsmodels as sm
import matplotlib.pyplot as plt

sample = np.random.uniform(0, 1, 50)
ecdf = sm.tools.ECDF(sample)

x = np.linspace(min(sample), max(sample))
y = ecdf(x)
plt.step(x, y)

随着版本 0.4scicits.statsmodels被重命名为statsmodels. ECDF现在位于distributions模块中（同时statsmodels.tools.tools.ECDF已折旧）。

import numpy as np
import statsmodels.api as sm # recommended import according to the docs
import matplotlib.pyplot as plt

sample = np.random.uniform(0, 1, 50)
ecdf = sm.distributions.ECDF(sample)

x = np.linspace(min(sample), max(sample))
y = ecdf(x)
plt.step(x, y)
plt.show()

score 18 · Accepted Answer

这看起来（几乎）正是你想要的。两件事情：

首先，结果是四个项目的元组。第三是垃圾箱的大小。第二个是最小 bin 的起点。第一个是每个 bin 中或下方的点数。（最后一个是超出限制的点数，但由于您没有设置任何点，所有点都将被分箱。）

其次，您需要重新调整结果，使最终值为 1，以遵循 CDF 的通常约定，否则它是正确的。

这是它在引擎盖下的作用：

def cumfreq(a, numbins=10, defaultreallimits=None):
    # docstring omitted
    h,l,b,e = histogram(a,numbins,defaultreallimits)
    cumhist = np.cumsum(h*1, axis=0)
    return cumhist,l,b,e

它进行直方图，然后生成每个 bin 中计数的累积总和。所以结果的第 i 个值是小于或等于第 i 个 bin 的最大值的数组值的个数。因此，最终值只是初始数组的大小。

最后，要绘制它，您需要使用 bin 的初始值和 bin 大小来确定您需要的 x 轴值。

另一种选择是使用numpy.histogramwhich 可以进行标准化并返回 bin 边缘。您需要自己计算结果计数的累积总和。

a = array([...]) # your array of numbers
num_bins = 20
counts, bin_edges = numpy.histogram(a, bins=num_bins, normed=True)
cdf = numpy.cumsum(counts)
pylab.plot(bin_edges[1:], cdf)

（bin_edges[1:]是每个 bin 的上边缘。）

score 15 · Accepted Answer

15

您是否尝试过 pyplot.hist 的累积 = True 参数？

于 2011-04-28T05:59:43.650 回答

score 7 · Accepted Answer

基于戴夫的回答的单线：

plt.plot(np.sort(arr), np.linspace(0, 1, len(arr), endpoint=False))

编辑：这也是 hans_meine 在评论中提出的。

score 3 · Accepted Answer

你想用 CDF 做什么？绘制它，这是一个开始。您可以尝试一些不同的值，如下所示：

from __future__ import division
import numpy as np
from scipy.stats import cumfreq
import pylab as plt

hi = 100.
a = np.arange(hi) ** 2
for nbins in ( 2, 20, 100 ):
    cf = cumfreq(a, nbins)  # bin values, lowerlimit, binsize, extrapoints
    w = hi / nbins
    x = np.linspace( w/2, hi - w/2, nbins )  # care
    # print x, cf
    plt.plot( x, cf[0], label=str(nbins) )

plt.legend()
plt.show()

直方图列出了 bin 数量的各种规则，例如num_bins ~ sqrt( len(a) )。

（细则：这里发生了两件完全不同的事情，

对原始数据进行分箱/直方图
plot通过 20 个分箱值插入一条平滑曲线。

这些中的任何一个都可能在“块状”或长尾的数据上消失，即使对于 1d 数据也是如此——2d、3d 数据变得越来越困难。
另请参阅 Density_estimation 和使用 scipy 高斯核密度估计）。

score 3 · Accepted Answer

我对 AFoglia 的方法有一个微不足道的补充，以规范化 CDF

n_counts,bin_edges = np.histogram(myarray,bins=11,normed=True) 
cdf = np.cumsum(n_counts)  # cdf not normalized, despite above
scale = 1.0/cdf[-1]
ncdf = scale * cdf

归一化 histo 使其整体统一，这意味着 cdf 不会被归一化。你必须自己缩放它。

score 3 · Accepted Answer

如果您想显示实际的真实 ECDF（正如 David B 所指出的那样，这是一个在 n 个数据点中的每一个处增加 1/n 的阶跃函数），我的建议是编写代码为每个数据点生成两个“绘图”点：

a = array([...]) # your array of numbers
sorted=np.sort(a)
x2 = []
y2 = []
y = 0
for x in sorted: 
    x2.extend([x,x])
    y2.append(y)
    y += 1.0 / len(a)
    y2.append(y)
plt.plot(x2,y2)

通过这种方式，您将获得一个带有 n 个步骤的图，这些步骤是 ECDF 的特征，这对于小到足以让步骤可见的数据集尤其有用。此外，无需对直方图进行任何分箱（这可能会给绘制的 ECDF 带来偏差）。

score 3 · Accepted Answer

我们可以只使用step函数 from matplotlib，它会绘制一个逐步图，这就是经验 CDF 的定义：

import numpy as np
from matplotlib import pyplot as plt

data = np.random.randn(11)

levels = np.linspace(0, 1, len(data) + 1)  # endpoint 1 is included by default
plt.step(sorted(list(data) + [max(data)]), levels)

最后的垂直线 atmax(data)是手动添加的。否则情节只会停在 level 1 - 1/len(data)。

或者，我们可以使用该where='post'选项step()

levels = np.linspace(1. / len(data), 1, len(data))
plt.step(sorted(data), levels, where='post')

在这种情况下，不会绘制从零开始的初始垂直线。

score 3 · Accepted Answer

这是使用散景

from bokeh.plotting import figure, show
from statsmodels.distributions.empirical_distribution import ECDF
ecdf = ECDF(pd_series)
p = figure(title="tests", tools="save", background_fill_color="#E8DDCB")
p.line(ecdf.x,ecdf.y)
show(p)

score 2 · Accepted Answer

假设 vals 包含您的值，那么您可以简单地绘制 CDF，如下所示：

y = numpy.arange(0, 101)
x = numpy.percentile(vals, y)
plot(x, y)

要在 0 和 1 之间缩放，只需将 y 除以 100。

score 2 · Accepted Answer

它是 seaborn 中使用累积 = True 参数的单线。干得好，

import seaborn as sns
sns.kdeplot(a, cumulative=True)

score 1 · Accepted Answer

（这是我对这个问题的回答的副本：Plotting CDF of a pandas series in python）

CDF 或累积分布函数图基本上是一个图表，X 轴为排序值，Y 轴为累积分布。因此，我将创建一个新系列，其中排序值作为索引，累积分布作为值。

首先创建一个示例系列：

import pandas as pd
import numpy as np
ser = pd.Series(np.random.normal(size=100))

对系列进行排序：

ser = ser.order()

现在，在继续之前，再次附加最后一个（也是最大的）值。这一步很重要，特别是对于小样本量以获得无偏 CDF：

ser[len(ser)] = ser.iloc[-1]

创建一个新系列，其中排序值作为索引，累积分布作为值

cum_dist = np.linspace(0.,1.,len(ser))
ser_cdf = pd.Series(cum_dist, index=ser)

最后，将函数绘制为步骤：

ser_cdf.plot(drawstyle='steps')

score 0 · Accepted Answer

到目前为止，没有一个答案能涵盖我到达这里时想要的东西，即：

def empirical_cdf(x, data):
    "evaluate ecdf of data at points x"
    return np.mean(data[None, :] <= x[:, None], axis=1)

它在点 x 的数组处评估给定数据集的经验 CDF，这些点不必排序。没有中间分箱，也没有外部库。

对大 x 更好地扩展的等效方法是对数据进行排序并使用 np.searchsorted：

def empirical_cdf(x, data):
    "evaluate ecdf of data at points x"
    data = np.sort(data)
    return np.searchsorted(data, x)/float(data.size)

score 0 · Accepted Answer

在我看来，以前的方法都没有完成绘制经验 CDF 的完整（和严格）工作，这是提问者的原始问题。我为任何迷失和同情的灵魂发布我的建议。

我的建议有以下几点：1）它考虑了在第一个表达式中定义的经验 CDF ，即，就像在 AW Van der Waart 的渐近统计（1998 年）中一样，2）它明确显示了函数的阶跃行为，3）它通过显示标记以解决不连续性，明确表明经验 CDF 从右侧连续，4) 它将极端的零和一值扩展到用户定义的边距。我希望它可以帮助某人：

def plot_cdf( data, xaxis = None, figsize = (20,10), line_style = 'b-',
ball_style = 'bo', xlabel = r"Random variable $X$", ylabel = "$N$-samples
empirical CDF $F_{X,N}(x)$" ):
     # Contribution of each data point to the empirical distribution
     weights = 1/data.size * np.ones_like( data )
     # CDF estimation
     cdf = np.cumsum( weights )
     # Plot central part of the CDF
     plt.figure( figsize = (20,10) )
     plt.step( np.sort( a ), cdf, line_style, where = 'post' )
     # Plot valid points at discontinuities
     plt.plot( np.sort( a ), cdf, ball_style )
     # Extract plot axis and extend outside the data range
     if not xaxis == None:
         (xmin, xmax, ymin, ymax) = plt.axis( )
         xmin = xaxis[0]
         xmax = xaxis[1]
         plt.axis( [xmin, xmax, ymin, ymax] )
     else:
         (xmin,xmax,_,_) = plt.axis()
         plt.plot( [xmin, a.min(), a.min()], np.zeros( 3 ), line_style )
     plt.plot( [a.max(), xmax], np.ones( 2 ), line_style )
     plt.xlabel( xlabel )
     plt.ylabel( ylabel )

score 0 · Accepted Answer

我为大型数据集评估 cdf 所做的工作 -

查找唯一值

unique_values = np.sort(pd.Series)
为数据集中这些已排序且唯一的值创建排名数组 -

等级 = np.arange(0,len(unique_values))/(len(unique_values)-1)
绘制 unique_values 与排名

示例下面的代码绘制了来自 kaggle 的人口数据集的 cdf -

us_census_data = pd.read_csv('acs2015_census_tract_data.csv')

population = us_census_data['TotalPop'].dropna()

## sort the unique values using pandas unique function
unique_pop = np.sort(population.unique())
cdf = np.arange(0,len(unique_pop),step=1)/(len(unique_pop)-1)

## plotting    
plt.plot(unique_pop,cdf)
plt.show()

score 0 · Accepted Answer

这可以使用seaborn.ecdfplot.
- seaborn是matplotlib.
- data可以是pandas.DataFrame, numpy.ndarray, mapping, 或sequence.
有关其他选项，请参阅如何将标记与 ECDF 图一起使用。

import seaborn as sns
import matplotlib.pyplot as plt

# lead sample dataframe
df = sns.load_dataset('penguins', cache=False)

# display(df.head(3))
  species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g     sex
0  Adelie  Torgersen            39.1           18.7              181.0       3750.0    Male
1  Adelie  Torgersen            39.5           17.4              186.0       3800.0  Female
2  Adelie  Torgersen            40.3           18.0              195.0       3250.0  Female

# plot ecdf
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))

p1 = sns.ecdfplot(data=df, x='bill_length_mm', ax=ax1)
p1.set_title('Without hue')

p2 = sns.ecdfplot(data=df, x='bill_length_mm', hue='species', ax=ax2)
p2.set_title('Separated species by hue')

score 0 · Accepted Answer

虽然，这里有很多很好的答案，但我会包括一个更定制的 ECDF 图

为经验累积分布函数生成值

import matplotlib.pyplot as plt

def ecdf_values(x):
    """
    Generate values for empirical cumulative distribution function
    
    Params
    --------
        x (array or list of numeric values): distribution for ECDF
    
    Returns
    --------
        x (array): x values
        y (array): percentile values
    """
    
    # Sort values and find length
    x = np.sort(x)
    n = len(x)
    # Create percentiles
    y = np.arange(1, n + 1, 1) / n
    return x, y

def ecdf_plot(x, name = 'Value', plot_normal = True, log_scale=False, save=False, save_name='Default'):
    """
    ECDF plot of x

    Params
    --------
        x (array or list of numerics): distribution for ECDF
        name (str): name of the distribution, used for labeling
        plot_normal (bool): plot the normal distribution (from mean and std of data)
        log_scale (bool): transform the scale to logarithmic
        save (bool) : save/export plot
        save_name (str) : filename to save the plot
    
    Returns
    --------
        none, displays plot
    
    """
    xs, ys = ecdf_values(x)
    fig = plt.figure(figsize = (10, 6))
    ax = plt.subplot(1, 1, 1)
    plt.step(xs, ys, linewidth = 2.5, c= 'b');
    
    plot_range = ax.get_xlim()[1] - ax.get_xlim()[0]
    fig_sizex = fig.get_size_inches()[0]
    data_inch = plot_range / fig_sizex
    right = 0.6 * data_inch + max(xs)
    gap = right - max(xs)
    left = min(xs) - gap
    
    if log_scale:
        ax.set_xscale('log')
        
    if plot_normal:
        gxs, gys = ecdf_values(np.random.normal(loc = xs.mean(), 
                                                scale = xs.std(), 
                                                size = 100000))
        plt.plot(gxs, gys, 'g');

    plt.vlines(x=min(xs), 
               ymin=0, 
               ymax=min(ys), 
               color = 'b', 
               linewidth = 2.5)
    
    # Add ticks
    plt.xticks(size = 16)
    plt.yticks(size = 16)
    # Add Labels
    plt.xlabel(f'{name}', size = 18)
    plt.ylabel('Percentile', size = 18)

    plt.vlines(x=min(xs), 
               ymin = min(ys), 
               ymax=0.065, 
               color = 'r', 
               linestyle = '-', 
               alpha = 0.8, 
               linewidth = 1.7)
    
    plt.vlines(x=max(xs), 
               ymin=0.935, 
               ymax=max(ys), 
               color = 'r', 
               linestyle = '-', 
               alpha = 0.8, 
               linewidth = 1.7)

    # Add Annotations
    plt.annotate(s = f'{min(xs):.2f}', 
                 xy = (min(xs), 
                       0.065),
                horizontalalignment = 'center',
                verticalalignment = 'bottom',
                size = 15)
    plt.annotate(s = f'{max(xs):.2f}', 
                 xy = (max(xs), 
                       0.935),
                horizontalalignment = 'center',
                verticalalignment = 'top',
                size = 15)
    
    ps = [0.25, 0.5, 0.75]

    for p in ps:

        ax.set_xlim(left = left, right = right)
        ax.set_ylim(bottom = 0)

        value = xs[np.where(ys > p)[0][0] - 1]
        pvalue = ys[np.where(ys > p)[0][0] - 1]

        plt.hlines(y=p, xmin=left, xmax = value,
                    linestyles = ':', colors = 'r', linewidth = 1.4);

        plt.vlines(x=value, ymin=0, ymax = pvalue, 
                   linestyles = ':', colors = 'r', linewidth = 1.4)
        
        plt.text(x = p / 3, y = p - 0.01, 
                 transform = ax.transAxes,
                 s = f'{int(100*p)}%', size = 15,
                 color = 'r', alpha = 0.7)

        plt.text(x = value, y = 0.01, size = 15,
                 horizontalalignment = 'left',
                 s = f'{value:.2f}', color = 'r', alpha = 0.8);

    # fit the labels into the figure
    plt.title(f'ECDF of {name}', size = 20)
    plt.tight_layout()
    

    if save:
        plt.savefig(save_name + '.png')

ecdf_plot(np.random.randn(100), name='Normal Distribution', save=True, save_name="ecdf")

其他资源：

python - 如何绘制经验 cdf (ecdf)

18 回答 18

Related

Reference