python - 循环遍历 Pandas Dataframe 以制作虚拟变量（1 或 0 输入）的有效方法

Question

我正在学习数据科学，并想为我的数据集制作虚拟变量。

我有一个具有“产品类别”列的数据框，该列是匹配类别的列表，类似于 ["Category1", "Category2".."CategoryN"]

我知道 Pandas 有很好的功能，可以自动生成虚拟变量（pandas.get_dummies），但在这种情况下，我不能使用它，我猜（？）。

我知道如何遍历每一行以将 1 附加到每列的匹配元素。我目前的代码是这样的：

for column_name in df.columns[1:]: #first column is "Product Category" and appended dummy columns (product category names) to the right previously
    for index, _ in enumerate(df[column_name][:10]): #limit 10 rows
        if column_name in df["Product Category"][index]:
            df[column_name][index] = 1

但是，上面的代码效率不高，我不能使用它，因为我有超过 100,000 行。我想以某种方式对整个数组进行操作，但我不知道该怎么做。

有人可以帮忙吗？

score 2 · Accepted Answer

使用get_dummies()，您可以指定将哪些列转换为虚拟变量。考虑以下示例，其中多个项目可以共享同一类别但只会属于一个虚拟变量：

df = pd.DataFrame({'Languages':  ['R', 'Python', 'C#', 'PHP', 'Java', 'XSLT', 'SQL'],
                   'ProductCategory':  ['Statistical', 'General Purpose', 
                                        'General Purpose', 'Web', 'General Purpose', 
                                        'Special Purpose', 'Special Purpose']})
# BEFORE
print(df)

#    Languages  ProductCategory
# 0          R      Statistical
# 1     Python  General Purpose
# 2         C#  General Purpose
# 3        PHP              Web
# 4       Java  General Purpose
# 5       XSLT  Special Purpose
# 6        SQL  Special Purpose

newdf = pd.get_dummies(df, columns=['ProductCategory'], prefix=['Categ'])
# AFTER
print(newdf)

#    Languages  Categ_General Purpose  Categ_Special Purpose  Categ_Statistical  Categ_Web
# 0         R                      0                      0                  1          0
# 1    Python                      1                      0                  0          0
# 2        C#                      1                      0                  0          0
# 3       PHP                      0                      0                  0          1
# 4      Java                      1                      0                  0          0
# 5      XSLT                      0                      1                  0          0
# 6       SQL                      0                      1                  0          0

score 2 · Accepted Answer

我假设您的问题是每一行都可以设置多个虚拟对象，因此“产品类别”的值是类别列表的一列。也许这应该可行，尽管我不确定它的内存效率如何。

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({"Product Category": [['Category1', 'Category2'],
   ...:                                         ['Category3'],
   ...:                                         ['Category1', 'Category4'],
   ...:                                         ['Category1', 'Category3', 'Category5']]})

In [3]: df
Out[3]:
                    Product Category
0             [Category1, Category2]
1                        [Category3]
2             [Category1, Category4]
3  [Category1, Category3, Category5]

In [4]: def list_to_dict(category_list):
   ...:         n_categories = len(category_list)
   ...:         return dict(zip(category_list, [1]*n_categories))
   ...:

In [5]: df_dummies = pd.DataFrame(list(df['Product Category'].apply(list_to_dict).values)).fillna(0)

In [6]: df_new = df.join(df_dummies)

In [7]: df_new
Out[7]:
                    Product Category  Category1  Category2  Category3 Category4  Category5
0             [Category1, Category2]          1          1          0         0          0
1                        [Category3]          0          0          1         0          0
2             [Category1, Category4]          1          0          0         1          0
3  [Category1, Category3, Category5]          1          0          1         0          1

python - 循环遍历 Pandas Dataframe 以制作虚拟变量（1 或 0 输入）的有效方法

2 回答 2

Related

Reference