我有以下带有以下列的熊猫数据框
user_id user_agent_id requests
所有列都包含整数。我不想对它们执行一些操作并使用 dask 数据框运行它们。这就是我所做的。
user_profile = cache_records_dataframe[['user_id', 'user_agent_id', 'requests']] \
.groupby(['user_id', 'user_agent_id']) \
.size().to_frame(name='appearances') \
.reset_index() # I am not sure I can run this on dask dataframe
user_profile_ddf = df.from_pandas(user_profile, npartitions=4)
user_profile_ddf['percent'] = user_profile_ddf.groupby('user_id')['appearances'] \
.apply(lambda x: x / x.sum(), meta=float) #Percentage of appearance for each user group
但我收到以下错误
raise ValueError("Not all divisions are known, can't align "
ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.
难道我做错了什么?在纯熊猫中它工作得很好,但是对于许多行来说它会变慢(尽管它们适合内存)所以我想并行化计算。