When using vaex I came across an unexpected error NameError: name 'column_2_0' is not defined
.
After some investigation I found that in my data source (HDF5 file) the column name causing problems is actually called column_2.0
and that vaex renames it to column_2_0
but when performing operations using column names I run into the error. Here is a simple example that reproduces this error:
import pandas as pd
import vaex
cols = ['abc_1', 'abc1', 'abc.1']
vals = list(range(0,len(cols)))
df = pd.DataFrame([vals], columns=cols)
dfv = vaex.from_pandas(df)
for col in dfv.column_names:
dfv = dfv[dfv[col].notna()]
dfv.count()
...
NameError: name 'abc_1_1' is not defined
In this case it appears that vaex tries to rename abc.1
to abc_1
which is already taken so instead it ends up using abc_1_1
.
I know that I can rename the column like dfv.rename('abc_1_1', 'abc_dot_1')
, but (a) I'd need to introduce special logic for naming conflicts like in this example where the column name that vaex comes up with is already taken and (b) I'd rather not have to do this manually each time I have a column that contains a period.
I could also enforce all my column names from source data to never use a period but this seems like a stretch given that pandas and other sources where data might come from in general don't have this restriction.
What are some ideas to deal with this problem other than the two I mentioned above?