我有一个数据框,其中有一列是用引号编码的逗号分隔值,即字符串对象。前任:
df['a']
'1,2,3,4,5'
'2,3,4,5,6'
我能够将字符串格式的值列表转换为 NumPy 数组,并且能够成功执行我的操作。
def func(x):
return something
for t_df in pd.read_csv("testset.csv",chunksize=2000):
t_df['predicted'] = t_df['prev'].parallel_apply(lambda x : arima(ast.literal_eval(x),1))
直到现在我没有任何问题。但是func运行预测模型非常耗时,数据帧大小为 200 万条记录。
因此,我尝试了 python 中的 cudf 包来利用 Pandas 上的 GPU 功能,例如数据帧。这里问题出现了
for t_df in pd.read_csv("testset.csv",chunksize=2): t_df['prev'] = t_df['prev'].apply(lambda x : np.array(ast.literal_eval(x))) t_df = cudf.DataFrame.from_pandas(t_df)
当我应用相同的操作时,它因错误而失败,该错误基本上无法将类似字符串的对象转换为 NumPy 数组。错误如下
> ---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-19-e7866d751352> in <module>
12 t_df['prev'] = t_df['prev'].apply(lambda x : np.array(ast.literal_eval(x)))
13 st = time.time()
---> 14 t_df = cudf.DataFrame.from_pandas(t_df)
15 t_df['predicted'] = 10
16 res.append(t_df)
/opt/conda/lib/python3.7/site-packages/cudf/core/dataframe.py in from_pandas(cls, dataframe, nan_as_null)
3109 # columns for a single key
3110 if len(vals.shape) == 1:
-> 3111 df[i] = Series(vals, nan_as_null=nan_as_null)
3112 else:
3113 vals = vals.T
/opt/conda/lib/python3.7/site-packages/cudf/core/series.py in __init__(self, data, index, name, nan_as_null, dtype)
128
129 if not isinstance(data, column.ColumnBase):
--> 130 data = column.as_column(data, nan_as_null=nan_as_null, dtype=dtype)
131
132 if index is not None and not isinstance(index, Index):
/opt/conda/lib/python3.7/site-packages/cudf/core/column/column.py in as_column(arbitrary, nan_as_null, dtype, length)
1353 elif arb_dtype.kind in ("O", "U"):
1354 data = as_column(
-> 1355 pa.Array.from_pandas(arbitrary), dtype=arbitrary.dtype
1356 )
1357 else:
/opt/conda/lib/python3.7/site-packages/cudf/core/column/column.py in as_column(arbitrary, nan_as_null, dtype, length)
1265 mask=pamask,
1266 size=pa_size,
-> 1267 offset=pa_offset,
1268 )
1269
/opt/conda/lib/python3.7/site-packages/cudf/core/column/numerical.py in __init__(self, data, dtype, mask, size, offset)
30 dtype = np.dtype(dtype)
31 if data.size % dtype.itemsize:
---> 32 raise ValueError("Buffer size must be divisible by element size")
33 if size is None:
34 size = data.size // dtype.itemsize
ValueError: Buffer size must be divisible by element size
可能的解决方案是什么?