0

When using vaex I came across an unexpected error NameError: name 'column_2_0' is not defined.

After some investigation I found that in my data source (HDF5 file) the column name causing problems is actually called column_2.0 and that vaex renames it to column_2_0 but when performing operations using column names I run into the error. Here is a simple example that reproduces this error:

import pandas as pd
import vaex
cols = ['abc_1', 'abc1', 'abc.1']
vals = list(range(0,len(cols)))
df = pd.DataFrame([vals], columns=cols)
dfv = vaex.from_pandas(df)

for col in dfv.column_names:
    dfv = dfv[dfv[col].notna()]

dfv.count()
...
NameError: name 'abc_1_1' is not defined

In this case it appears that vaex tries to rename abc.1 to abc_1 which is already taken so instead it ends up using abc_1_1.

I know that I can rename the column like dfv.rename('abc_1_1', 'abc_dot_1'), but (a) I'd need to introduce special logic for naming conflicts like in this example where the column name that vaex comes up with is already taken and (b) I'd rather not have to do this manually each time I have a column that contains a period.

I could also enforce all my column names from source data to never use a period but this seems like a stretch given that pandas and other sources where data might come from in general don't have this restriction.

What are some ideas to deal with this problem other than the two I mentioned above?

4

1 回答 1

1

在 Vaex 中,这些列实际上是“表达式”。表达式允许您在执行常规数据帧操作时在幕后构建某种计算图。但是,这要求列名尽可能“干净”。

因此不允许使用像“2”或“2.5”这样的列名,因为表达式系统可以将它们解释为数字而不是列名。还有像 'first-name' 这样的列名,表达式系统可以解释为df['first'] - df['name'].

为了避免这种情况,vaex 将巧妙地重命名列,以便它们可以在表达式系统中使用。这实际上是极其复杂的。因此,在上面的示例中,您发现了一个尚未涵盖的案例(isna/ notna)。

顺便说一句,您始终可以通过df.get_column_names(alias=True).

于 2020-06-10T08:41:19.630 回答