python - 如何有效地从具有多列的 Vaex DataFrame 中删除非有限值？

Question

我的数据具有等于正无穷大和负无穷大的值。Vaex 具有的功能dropna，dropmissing但dropnan不能用于删除非有限值。

我目前的方法是遍历感兴趣的每一列并覆盖过滤数据集，从每一列中删除非有限值：

...
for col in cols:
   df = df[df.col.isfinite()]
...

虽然这种方法确实给了我正确的结果，但它似乎效率很低，因为它需要很长时间才能运行，即使我的数据集只有几行和几千列。

在 Vaex 中删除具有非有限值的行的首选方法是什么？

更新：

这是一个工作示例，用于演示我在即使是微不足道的数据集上也遇到的缓慢：

import vaex
import numpy as np
import pandas as pd

#create a dummy data frame with 1000 columns and a few rows, some with nan/inf
arr= []
for i in range(1000):
    arr.append([1] * 1 + [2] * 1 + [3] * 1 + [0] * 1 + [np.inf] * 1 + [-np.inf] * 1 + [np.nan] * 1)
df = pd.DataFrame(arr)
df = df.transpose()
df.columns = df.columns.map(str)
df = df.add_prefix('a')

df = vaex.from_pandas(df)

#eliminate rows that are not finite
for col in df.columns.keys(): #<-- this loop takes several minutes to run, I would expect it to be nearly instantaneous
    df = df[df[col].isfinite()]
df

更新 2：单元格中的值略有不同，另一种选择有限记录的方法可以快速工作但返回不正确的结果：

import vaex
import numpy as np
import pandas as pd

arr= []
for i in range(2):
    if i == 1:
        arr.append([np.inf] * 1 + [2] * 1 + [3] * 1 + [0] * 1 + [1] * 1 + [1] * 1 + [1] * 1)
    else:
        arr.append([1] * 1 + [2] * 1 + [3] * 1 + [0] * 1 + [np.inf] * 1 + [-np.inf] * 1 + [np.nan] * 1)
df = pd.DataFrame(arr)
df = df.transpose()
df.columns = df.columns.map(str)
df = df.add_prefix('a')

df = vaex.from_pandas(df)
df

#   a0  a1
0   1   inf
1   2   2
2   3   3
3   0   0
4   inf 1
5   -inf    1
6   nan 1

is_col_finite = np.array([df[col].isfinite() for col in df.columns.keys()])
all_finite = np.all(is_col_finite, axis=0)
df = df[all_finite]
df

#   a0  a1
0   2   2
1   3   3
2   0   0
3   inf 1
4   -inf    1
5   nan 1

score 0 · Accepted Answer

如果我很好地理解您的问题，您想删除至少包含 1 个非有限值的行。

除了在 for 循环的每次迭代中过滤df之外，您还可以创建一个to_keep变量，该变量将是一个布尔掩码：

真 == 保留行
False == 删除该行

to_keep = None
for column in df.columns:
   if to_keep is None:
      to_keep = df[column].isfinite()
   else:
      to_keep = to_keep and to_keep

df = df[to_keep]

我用你提供的片段试过这个：

import vaex
import numpy as np
import pandas as pd

arr = []
for i in range(1000):
    arr.append([1] * 1 + [2] * 1 + [3] * 1 + [0] * 1 + [np.inf] * 1 + [-np.inf] * 1 + [np.nan] * 1)
df = pd.DataFrame(arr)
df = df.transpose()
df.columns = df.columns.map(str)
df = df.add_prefix('a')

df = vaex.from_pandas(df)

在 for 循环的每次迭代中过滤：> 10 分钟（我在结束前杀死了它，因为它花费了太多时间）
使用to_keep功能：3s

这个掩码概念与您的UPDATE 2非常相似，其优点是只有一个向量而不是所有向量然后使用np.all

如果您想使用您的解决方案，您必须执行以下操作：

is_col_finite = np.array([df[col].isfinite().values for col in df.columns.keys()]) # add .values to convert the vaex expression to a numpy array
to_keep = np.all(is_col_finite, axis=0)

df["to_keep"] = to_keep
df[df["to_keep"]]

这将打印预期的输出：

a0  a1  to_keep
2   2   True
3   3   True
0   0   True

备注：我不知道为什么但我不能直接使用df[to_keep]所以我不得不添加一个带有掩码的列

这是我的 vaex 软件包的版本：

vaex==3.0.0
vaex-arrow==0.5.1
vaex-astro==0.7.0
vaex-core==2.0.3
vaex-hdf5==0.6.0
vaex-jupyter==0.5.2
vaex-ml==0.9.0
vaex-server==0.3.1
vaex-viz==0.4.0

python - 如何有效地从具有多列的 Vaex DataFrame 中删除非有限值？

1 回答 1

Related

Reference