machine-learning - 编码分类列 - 标签编码与决策树的一种热编码

Question

决策树和随机森林使用拆分逻辑的工作方式，我的印象是标签编码对于这些模型来说不是问题，因为我们无论如何都要拆分列。例如：如果我们有性别为“男”、“女”和“其他”，使用标签编码，它变成0,1,2，解释为0<1<2。但是由于我们要拆分列，我认为这并不重要，因为我们要拆分“男性”还是“0”是一回事。但是当我在数据集上同时尝试标签和一种热编码时，一种热编码提供了更好的准确性和精度。你能不能分享你的想法。

The ACCURACY SCORE of various models on train and test are:

The accuracy score of simple decision tree on label encoded data :    TRAIN: 86.46%     TEST: 79.42%
The accuracy score of tuned decision tree on label encoded data :     TRAIN: 81.74%     TEST: 81.33%
The accuracy score of random forest ensembler on label encoded data:  TRAIN: 82.26%     TEST: 81.63%
The accuracy score of simple decision tree on one hot encoded data :  TRAIN: 86.46%     TEST: 79.74%
The accuracy score of tuned decision tree on one hot encoded data :   TRAIN: 82.04%     TEST: 81.46%
The accuracy score of random forest ensembler on one hot encoded data:TRAIN: 82.41%     TEST: 81.66%

he PRECISION SCORE of various models on train and test are:

The precision score of simple decision tree on label encoded data :             TRAIN: 78.26%   TEST: 57.92%
The precision score of tuned decision tree on label encoded data :              hTRAIN: 66.54%  TEST: 64.6%
The precision score of random forest ensembler on label encoded data:           TRAIN: 70.1%    TEST: 67.44%
The precision score of simple decision tree on one hot encoded data :           TRAIN: 78.26%   TEST: 58.84%
The precision score of tuned decision tree on one hot encoded data :            TRAIN: 68.06%   TEST: 65.81%
The precision score of random forest ensembler on one hot encoded data:         TRAIN: 70.34%   TEST: 67.32%

score 1 · Accepted Answer

您可以将其视为正则化效果：您的模型更简单，因此更通用。所以你会得到更好的表现。

以你的性别特征为例：标签编码的 [male, female, other] 变为 [0, 1, 2]。

现在假设存在仅适用于女性的其他特征的特定配置：树需要两个分支来选择女性，一个选择大于零的性别，另一个选择小于 2 的性别。

相反，使用 one-hot 编码，您只需要一个分支来进行选择，比如 sex_female 大于零。

machine-learning - 编码分类列 - 标签编码与决策树的一种热编码

1 回答 1

Related

Reference