我试图使用 sklearn 中的 LabelEncoder 和 OneHotEncoder 对数据集进行 1-hot-encoding,方法是首先对每列进行 LabelEncoding,然后在列上执行 OneHotEncoding。注意:我故意将两列的数据帧的第 1 行设为 nan,这样 LabelEncoder 就不会丢失。
这是代码:
training_data.dropna(axis=1,how='any',inplace=True)
print training_data.shape
rows = [1]
training_data.loc[rows, endocing_columns] = float("nan")
print training_data.loc[1].mail_category
print training_data.loc[1].mail_type
for col in endocing_columns:
label_encoder=LabelEncoder()
oneHot_encoder=OneHotEncoder(sparse=False)
label_encoder.fit(training_data[col])
temp_col = pd.DataFrame(label_encoder.transform(training_data[col]))
oneHot_encoder.fit(temp_col)
temp = oneHot_encoder.transform(temp_col)
print training_data.shape
temp=pd.DataFrame(temp)
training_data[col].value_counts().index])
# In side by side concatenation index values should be same
# Setting the index values similar to the training_data data frame
temp=temp.set_index(training_data.index.values)
# adding the new One Hot Encoded varibales to the train data frame
training_data=pd.concat([training_data,temp],axis=1)
training_data.drop(col, axis=1, inplace=True)
print label_encoder.classes_
temp_col = pd.DataFrame(label_encoder.transform(test_data[col]))
temp = oneHot_encoder.transform(temp_col)
这是代码的输出(请注意,在标签编码器的打印类中,有 nan):
(478192, 46)
nan
nan
(478192, 46)
[nan 'mail_category_1' 'mail_category_10' 'mail_category_11'
'mail_category_12' 'mail_category_13' 'mail_category_14'
'mail_category_15' 'mail_category_16' 'mail_category_17'
'mail_category_18' 'mail_category_2' 'mail_category_3' 'mail_category_4'
'mail_category_5' 'mail_category_6' 'mail_category_7' 'mail_category_8'
'mail_category_9']
Traceback (most recent call last):
File "basic_analysis.py", line 46, in <module>
temp_col = pd.DataFrame(label_encoder.transform(test_data[col]))
File "/usr/local/lib/python2.7/dist-packages/sklearn/preprocessing/label.py", line 148, in transform
raise ValueError("y contains new labels: %s" % str(diff))
ValueError: y contains new labels: [nan]