python - 用于返回下一个特征以拆分树的决策树实现

Question

假设我的数据由水果组成，由它们的颜色和形状以及更多特征来描述。我想返回最多具有用户所述功能的 X 水果，并且我想在最少的问题中做到这一点。

我总是问用户的第一个问题是水果的颜色和形状是什么。根据用户的回答，我想要求 K 更多特征，如纹理大小剥离类型等。我希望 K 是返回最准确 X 结果的最小 num 因此我想知道下一个特征是什么我应该问用户。我的数据库由分类为特征（任意值）的水果组成。

是机器学习问题吗？我应该使用什么算法以及应该使用哪种实现。我试图在 scikit-learn、nltk、weka 中寻找合适的算法来回答这个问题。要么这些算法不适合回答这个问题，要么我需要更具体的指导来使用它们。

谢谢！

score 1 · Accepted Answer

是的。

决策树将点投影到每个特征上并找到最佳分割。这种分割可以通过不同的指标来确定，例如：基尼指数或熵（信息增益） Sci-kit learn 在sklearn.tree中有这个

假设您有 5 个数据点：

 color   shape   fruit
 orange  oblong  orange
 red     round   apple
 orange  round   orange
 red     oblong  apple
 red     round   apple

所以要训练你会做这样的事情：

feature   class  |  feature  class
orange    orange |  oblong   orange
red       apple  |  round    apple
orange    orange |  round    orange
red       apple  |  oblong   apple
red       apple  |  round    apple

正如你所看到的，最好的分割是颜色，因为对于这个数据集，如果颜色=红色，那么水果=苹果，如果颜色=橙色，那么水果=橙色。

在这些数据点上进行训练，您将拥有决策树：

        color
___________________
|                 |
|                 |
red               orange
apple             orange

在现实生活中，这些拆分将基于数值，即num > .52.

至于为此使用什么算法，这取决于。您必须对自己的数据进行研究，因为它更像是每个数据集/偏好类型的东西。

您可以像这样在上面的示例中使用 sci-kit learn：

from sklearn.trees import DecisionTreeClassifier
#make your sample matrix 
samples = [[1,1], [0,0], [1,0], [0,1], [0,0]]
#make your target vector ( in this case fruit)
fruitname = [1, 0, 1, 0, 0]
#create and fit the model
dtree =  DecisionTreeClassifier()
dtree =  dtree.fit(samples, fruitname)
#test an unknown red fruit that is oblong
dtree.predict([0,1])

注意 color=1 表示水果是橙色的，shape=1 表示水果是长方形的。

查看 sci-kit用户指南以获得更深入的概述。

score 0 · Accepted Answer

是的，这是一个机器学习问题（在某种程度上）。我建议使用有许多不同算法的决策树方法。ID3和C4.5是简单的算法，可帮助您最小化深度，因为它将下一个问题（拆分树的下一个特征）基于最大信息增益。

python - 用于返回下一个特征以拆分树的决策树实现

2 回答 2

Related

Reference