python - 从 DataFrame/Series 中提取时缺失值的奇怪行为

Question

在 pandas 中工作时，我遇到了一些非常奇怪的缺失值行为，这让我大吃一惊。

请注意以下事项：

import pandas as pd
import numpy as np
from numpy import nan as NA
from pandas import DataFame

In [1]: L1 = [NA, NA]
In [2]: L1
Out[2]: [nan, nan]
In [3]: set(L1)
Out[3]: {nan}

到目前为止一切顺利，正如预期的那样，列表 L1 的集合包含单个 NA 值。但是现在我完全困惑于当你做同样的事情但基于从数据框系列中提取的列表时会发生什么

In [4]: EG = DataFrame(np.random.rand(10), columns = ['Data'])
In [5]: EG['Data'][5:7] = NA
In [6]: L2 = list(EG['Data'][5:7])
In [7]: L2
Out[8]: [nan, nan]
In [9]: set(L2)
Out[9]: {nan, nan}

这里发生了什么？当它们所基于的列表看起来完全相同时，为什么这些集合会有所不同？

我做了一些挖掘，认为类型可能不同（这似乎令人惊讶，因为 NA 值是以我看来完全相同的方式创建的）。请参阅以下内容：

In [10]: type(L1[0])
Out[10]: float
In [11]: type(L1[1])
Out[11]: float
In [12]: type(L2[0])
Out[12]: numpy.float64
In [13]: type(L2[1])
Out[13]: numpy.float64

很明显类型是不同的，这已经让我大吃一惊了，但是如果我将 L2 的每个元素都转换为浮点数，就像在 L1 中一样，奇怪的集合行为应该消失：

In [14]: L3 = [float(elem) for elem in L2]
In [15]: L3
Out[15]: [nan, nan]
In [16]: type(L3[0])
Out[16]: float
In [17]: type(L3[1])
Out[17]: float 
In [18]: set(L3)
Out[18]: {nan, nan}

即使 L3 中的元素类型与 L1 中的元素类型完全相同，问题仍然存在。

有人可以帮忙吗？

使用 groupby 聚合数据时，我依赖于 set(L) 的常规功能。我注意到了这个问题，它让我发疯。我有兴趣了解解决方法，但我更想知道这里到底发生了什么......

请帮忙...

编辑：为了回应用户评论，我发布了我实际上试图聚合数据的代码。我不确定这会改变问题的维度，但它可能有助于理解为什么如此令人沮丧：

def NoActionRequired(x):
""" This function is used to aggregate the data that is believed to be equal within multi line/day groups. It puts the data 
    into a list and then if that list forms a set of length 1 (which it must if the data are in fact equal) then the single
    value contained in the set is returned, otherwise the list is returned. This allows for the fact that we may be wrong about
    the equality of the data, and it is something that can be tested after aggreagation."""

    L = list(x)
    S = set(L)
    if len(S) == 1:
        return S.pop()
    else:
        return L

DFGrouped['Data'].agg(NoActionRequired)

这个想法是，如果组中的所有数据都相同，则返回单个值，否则返回数据列表。

score 1 · Accepted Answer

我现在看到的唯一解释是NA第一个列表中的所有对象都是相同的对象：

>>> L1 = [NA, NA]
>>> L1
[nan, nan]
>>> L1[0] is L1[1]
True

而第二个列表中的对象是不同的对象：

>>> L2 = list(pd.Series([NA, NA]))
>>> L2
[nan, nan]
>>> L2[0] is L2[1]
False

至于您的功能，我建议使用pandas.Series.unique()而不是 set，例如：

def NoActionRequired(x):
    # ...    
    S = x.unique()
    if len(S) == 1:
        return S[0]
    else:
        return list(x)

看起来unique()适用于NaN：

>>> pd.Series([NA, NA]).unique()
array([ nan])

编辑以检查 NA 是否在列表中，您可以使用 np.isnan() 函数：

>>> L = [NA, 1, 2]
>>> np.isnan(L)
array([ True, False, False], dtype=bool)

python - 从 DataFrame/Series 中提取时缺失值的奇怪行为

1 回答 1

Related

Reference