python - 为什么 LabelEncoder 不读取值？

Question

我试图使用 sklearn 中的 LabelEncoder 和 OneHotEncoder 对数据集进行 1-hot-encoding，方法是首先对每列进行 LabelEncoding，然后在列上执行 OneHotEncoding。注意：我故意将两列的数据帧的第 1 行设为 nan，这样 LabelEncoder 就不会丢失。

这是代码：

training_data.dropna(axis=1,how='any',inplace=True)
print training_data.shape
rows = [1]
training_data.loc[rows, endocing_columns] = float("nan")


print training_data.loc[1].mail_category 
print training_data.loc[1].mail_type 
for col in endocing_columns:
    label_encoder=LabelEncoder()
    oneHot_encoder=OneHotEncoder(sparse=False)
    label_encoder.fit(training_data[col])
    temp_col = pd.DataFrame(label_encoder.transform(training_data[col]))

    oneHot_encoder.fit(temp_col)
    temp = oneHot_encoder.transform(temp_col)
    print training_data.shape
    temp=pd.DataFrame(temp)
    training_data[col].value_counts().index])
    # In side by side concatenation index values should be same
    # Setting the index values similar to the training_data data frame
    temp=temp.set_index(training_data.index.values)
    # adding the new One Hot Encoded varibales to the train data frame
    training_data=pd.concat([training_data,temp],axis=1)
    training_data.drop(col, axis=1, inplace=True)

    print label_encoder.classes_
    temp_col = pd.DataFrame(label_encoder.transform(test_data[col]))
    temp = oneHot_encoder.transform(temp_col)

这是代码的输出（请注意，在标签编码器的打印类中，有 nan）：

(478192, 46)
nan
nan
(478192, 46)
[nan 'mail_category_1' 'mail_category_10' 'mail_category_11'
 'mail_category_12' 'mail_category_13' 'mail_category_14'
 'mail_category_15' 'mail_category_16' 'mail_category_17'
 'mail_category_18' 'mail_category_2' 'mail_category_3' 'mail_category_4'
 'mail_category_5' 'mail_category_6' 'mail_category_7' 'mail_category_8'
 'mail_category_9']
Traceback (most recent call last):
  File "basic_analysis.py", line 46, in <module>
    temp_col = pd.DataFrame(label_encoder.transform(test_data[col]))
  File "/usr/local/lib/python2.7/dist-packages/sklearn/preprocessing/label.py", line 148, in transform
    raise ValueError("y contains new labels: %s" % str(diff))
ValueError: y contains new labels: [nan]

python - 为什么 LabelEncoder 不读取值？

0 回答 0

Related

Reference