python - 使用 Pandas 的 NaN 过滤时间序列中的漏洞

Question

我在使用 pandas NA 过滤数据时遇到了一些麻烦。我有一个如下所示的数据框：

        Jan       Feb       Mar       Apr       May June
0  0.349143  0.249041  0.244352       NaN  0.425336  NaN
1  0.530616  0.816829       NaN  0.212282  0.099364  NaN
2  0.713001  0.073601  0.242077  0.553908  NaN       NaN
3  0.245295  0.007016  0.444352  0.515705  0.497119  NaN
4  0.195662  0.007249       NaN  0.852287  NaN       NaN

我需要过滤掉有“洞”的行。我认为这些行是时间序列，我所说的洞是指序列中间的 NA，但不是最后。即在上面的数据框中，第 0、1 和 4 行有孔，但第 2 和第 3 行没有（仅在行尾有 NA）。

到目前为止，我能想到的唯一方法是这样的：

for rowindex, row in df.iterrows():
    # now step through each entry in the row 
    # and after encountering the first NA, 
    # check if all subsequent values are NA too.

但我希望可能有一种不那么复杂和更有效的方法来做到这一点。

谢谢，安妮

score 3 · Accepted Answer

正如你所说，循环（iterrows）是最后的手段。试试这个，它使用applywithaxis=1而不是遍历行。

In [19]: def holey(s):
    starts_at = s.notnull().argmax()
    next_null = s[starts_at:].isnull().argmax()
    if next_null == 0:
        return False
    any_values_left = s[next_null:].notnull().any()
    return any_values_left
   ....: 

In [20]: df.apply(holey, axis=1)
Out[20]: 
0     True
1     True
2    False
3    False
4     True
dtype: bool

现在您可以像df[~df.apply(holey, axis=1)].

这里有一个方便的习惯用法：用于在一系列布尔值中argmax()查找第一次出现的。True

score 3 · Accepted Answer

这是使用 NumPy 的另一种方式。它更快，因为它在整个底层数组上使用 NumPy 函数，而不是单独将 Python 函数应用于每一行：

import io
import pandas as pd
import numpy as np

content = '''\
        Jan       Feb       Mar       Apr       May June
   0.349143  0.249041  0.244352       NaN  0.425336  NaN
   0.530616  0.816829       NaN  0.212282  0.099364  NaN
   0.713001  0.073601  0.242077  0.553908  NaN       NaN
   0.245295  0.007016  0.444352  0.515705  0.497119  NaN
   0.195662  0.007249       NaN  0.852287  NaN       NaN'''

df = pd.read_table(io.BytesIO(content), sep='\s+')

def remove_rows_with_holes(df):
    nans = np.isnan(df.values)
    # print(nans)
    # [[False False False  True False  True]
    #  [False False  True False False  True]
    #  [False False False False  True  True]
    #  [False False False False False  True]
    #  [False False  True False  True  True]]

    # First index (per row) which is a NaN
    nan_index = np.argmax(nans, axis=1)
    # print(nan_index)
    # [3 2 4 5 2]

    # Last index (per row) which is not a NaN
    h, w = nans.shape
    not_nan_index = w - np.argmin(np.fliplr(nans), axis=1)
    # print(not_nan_index)
    # [5 5 4 5 4]

    mask = nan_index >= not_nan_index
    # print(mask)
    # [False False  True  True False]

    # print(df[mask])
    #         Jan       Feb       Mar       Apr       May  June
    # 2  0.713001  0.073601  0.242077  0.553908       NaN   NaN
    # 3  0.245295  0.007016  0.444352  0.515705  0.497119   NaN
    return df[mask]

def holey(s):
    starts_at = s.notnull().argmax()
    next_null = s[starts_at:].isnull().argmax()
    if next_null == 0:
        return False
    any_values_left = s[next_null:].notnull().any()
    return any_values_left

def remove_using_holey(df):
    mask = df.apply(holey, axis=1)
    return df[~mask]

以下是 timeit 结果：

In [78]: %timeit remove_using_holey(df)
1000 loops, best of 3: 1.53 ms per loop

In [79]: %timeit remove_rows_with_holes(df)
10000 loops, best of 3: 85.6 us per loop

随着 DataFrame 中行数的增加，差异变得更加显着：

In [85]: df = pd.concat([df]*100)

In [86]: %timeit remove_using_holey(df)
1 loops, best of 3: 1.29 s per loop

In [87]: %timeit remove_rows_with_holes(df)
1000 loops, best of 3: 440 us per loop

In [88]: 1.29 * 10**6 / 440
Out[88]: 2931.818181818182

score 1 · Accepted Answer

我遇到了与 OP 类似的问题。不知道为什么 unutbu 的解决方案对我不起作用，但这成功了：

def remove_rows_with_holes(df):
    nans = np.isnan(df.values)
    mask = np.array(np.prod(~nans, axis=1), dtype=bool)
    return df[mask]

要忽略一列，请在制作掩码之前将其删除。

感谢你的帮助！

python - 使用 Pandas 的 NaN 过滤时间序列中的漏洞

3 回答 3

Related

Reference