scikit-learn - 如何从训练有素的随机森林中找到关键树/特征？

Question

我正在使用 Scikit-Learn 随机森林分类器并尝试提取有意义的树/特征，以便更好地理解预测结果。

我发现这种方法在文档中似乎相关（http://scikit-learn.org/dev/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.get_params），但找不到示例如何使用它。

如果可能的话，我也希望将这些树可视化，任何相关的代码都会很棒。

谢谢！

score 16 · Accepted Answer

我认为您正在寻找 Forest.feature_importances_。这使您可以查看每个输入特征对最终模型的相对重要性。这是一个简单的例子。

import random
import numpy as np
from sklearn.ensemble import RandomForestClassifier 


#Lets set up a training dataset.  We'll make 100 entries, each with 19 features and
#each row classified as either 0 and 1.  We'll control the first 3 features to artificially
#set the first 3 features of rows classified as "1" to a set value, so that we know these are the "important" features.  If we do it right, the model should point out these three as important.  
#The rest of the features will just be noise.
train_data = [] ##must be all floats.
for x in range(100):
    line = []
    if random.random()>0.5:
        line.append(1.0)
        #Let's add 3 features that we know indicate a row classified as "1".
        line.append(.77)
        line.append(.33)
        line.append(.55)
        for x in range(16):#fill in the rest with noise
            line.append(random.random())
    else:
        #this is a "0" row, so fill it with noise.
        line.append(0.0)
        for x in range(19):
            line.append(random.random())        
    train_data.append(line)
train_data = np.array(train_data)


# Create the random forest object which will include all the parameters
# for the fit.  Make sure to set compute_importances=True
Forest = RandomForestClassifier(n_estimators = 100, compute_importances=True)

# Fit the training data to the training output and create the decision
# trees.  This tells the model that the first column in our data is the classification,
# and the rest of the columns are the features.
Forest = Forest.fit(train_data[0::,1::],train_data[0::,0])

#now you can see the importance of each feature in Forest.feature_importances_
# these values will all add up to one.  Let's call the "important" ones the ones that are above average.
important_features = []
for x,i in enumerate(Forest.feature_importances_):
    if i>np.average(Forest.feature_importances_):
        important_features.append(str(x))
print 'Most important features:',', '.join(important_features)
#we see that the model correctly detected that the first three features are the most important, just as we expected!

score 6 · Accepted Answer

要获得相对特征的重要性，请阅读文档的相关部分以及同一部分中链接示例的代码。

树本身存储在estimators_随机森林实例的属性中（仅在调用fit方法之后）。现在要提取“密钥树”，首先需要您定义它是什么以及您期望用它做什么。

您可以通过计算保留测试集上的分数来对单个树进行排名，但我不知道从中得到什么。

您是否想通过减少树木数量而不降低总体森林准确度来修剪森林以加快预测速度？

score 2 · Accepted Answer

这是我可视化树的方式：

完成所有预处理、拆分等后，首先制作模型：

# max number of trees = 100
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 100, criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)

作出预测：

# Predicting the Test set results
y_pred = classifier.predict(X_test)

然后绘制重要性图。变量dataset是原始数据框的名称。

# get importances from RF
importances = classifier.feature_importances_

# then sort them descending
indices = np.argsort(importances)

# get the features from the original data set
features = dataset.columns[0:26]

# plot them with a horizontal bar chart
plt.figure(1)
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), features[indices])
plt.xlabel('Relative Importance')

这会产生如下图：

scikit-learn - 如何从训练有素的随机森林中找到关键树/特征？

3 回答 3

Related

Reference