pandas - 大型数据集的一种热编码

Question

我想使用在 mlxtend 库 apriori 算法中实现的关联规则来构建推荐系统。在我的销售数据中，有关于 3600 万笔交易和 5 万种独特产品的信息。我尝试使用 sklearn OneHotEncoder 和 pandas get_dummies() 但两者都给出 OOM 错误，因为它们无法创建形状为 (36 mil, 50k) 的帧

MemoryError: Unable to allocate 398. GiB for an array with shape (36113798, 50087) and data type uint8

还有其他解决方案吗？

score 1 · Accepted Answer

和你一样，一开始我也遇到了 mlxtend 的内存不足错误，但以下小改动完全解决了这个问题。
`

from mlxtend.preprocessing import TransactionEncoder   

import pandas as pd

te = TransactionEncoder() 

#te_ary = te.fit(itemSetList).transform(itemSetList)

#df = pd.DataFrame(te_ary, columns=te.columns_)

fitted = te.fit(itemSetList)

te_ary = fitted.transform(itemSetList, sparse=True) # seemed to work good

df = pd.DataFrame.sparse.from_spmatrix(te_ary, columns=te.columns_) # seemed to work good

# now you can call mlxtend's fpgrowth() followed by association_rules()

`

您还应该在大交易数据集上使用 fpgrowth 而不是 apriori，因为 apriori 太原始了。fpgrowth 比 apriori 更智能和更现代，但给出了相同的结果。mlxtend 库支持 apriori 和 fpgrowth。

score 0 · Accepted Answer

我认为一个好的解决方案是使用嵌入而不是单热编码来解决您的问题。此外，我建议您将数据集拆分为更小的子集，以进一步避免内存消耗问题。

您还应该咨询此线程：https ://datascience.stackexchange.com/questions/29851/one-hot-encoding-vs-word-embeding-when-to-choose-one-or-another

pandas - 大型数据集的一种热编码

2 回答 2

Related

Reference