根据 Orange 文档,规则的 class_distribution 属性是“该规则涵盖的数据实例中的类分布”。但是,如果我将规则应用于用于派生规则的数据集中的数据实例,则触发规则 r 的实例数有时与 r.class_distribution 中的计数不同。
例如,如果我使用 Orange 包提供的成人样本数据集和以下代码:
import numpy as np
import Orange
data = Orange.data.Table("C:\Python27\Lib\site-packages\Orange\datasets/adult_sample")
cn2_learner = Orange.classification.rules.CN2UnorderedLearner()
#only want to learn rules for class0:
cn2_learner.target_class = 0
cn2_classifier = Orange.classification.rules.RuleLearner.__call__(cn2_learner, data, 0)
RS = cn2_classifier.rules #rule set
rulesFired=[[r(d) for r in RS] for d in data]
#Find what rules fire for each data instance
classV = np.array([d.get_class()==data.domain.class_var.values[1] for d in data]).astype(int)
ind0 = np.where(classV==0)[0] #indices of data with class 0
ind1 = np.where(classV==1)[0] #indices of data with class 1
rulesFired0=np.delete(rulesFired, ind1,0) #indicates what rules fired for each class 0 instance
rulesFired1=np.delete(rulesFired, ind0,0) #indicates what rules fired for each class 1 instance
ruleFreq0 = np.sum(rulesFired0,axis=0) #how many class0 instances fired for each rule
ruleFreq1 = np.sum(rulesFired1,axis=0) #how many class1 instances fired for each rule
#Check to see if instances that fired rules match up with r.class_distribution
for ind in range(len(RS)):
r=RS[ind]
if r.class_distribution[0] != ruleFreq0[ind] or r.class_distribution[1] != ruleFreq1[ind]:
print ind #print indices of rules with mismatches
82 条规则中有 32 条的 rule.class_distribution 与上面定义的 ruleFreq 不匹配。
我们以 RS[5] 为例:
#IF education=['Prof-school'] AND age>31.0 THEN y=>50K<3.000, 0.000>
RS[5].class_distribution = <3.000, 0.000> .
据此,来自 0 类的 3 个实例触发了该规则,但是 ruleFreq0[5] = 7,这意味着当我对所有数据运行规则时,来自 0 类的 7 个实例触发了该规则。这 7 个实例由 ind0[np.where(rulesFired0[:,5])[0]] 索引。一些例子是:
#data[220]: [43.000000, 'Private', 350661.000000, 'Prof-school', 15.000000, 'Separated', 'Tech-support', 'Not-in-family', 'White', 'Male', 0.000000, 0.000000, 50.000000, 'Columbia', '>50K']
#data[240]: [43.000000, 'State-gov', 33331.000000, 'Prof-school', 15.000000, 'Married-civ-spouse', 'Prof-specialty', 'Husband', 'White', 'Male', 0.000000, 1977.000000, 70.000000, 'United-States', '>50K']
#data[372]: [41.000000, 'Private', 130126.000000, 'Prof-school', 15.000000, 'Married-civ-spouse', 'Prof-specialty', 'Husband', 'White', 'Male', 0.000000, 0.000000, 80.000000, 'United-States', '>50K']
最后,这是我的问题:
这是 Orange 代码中的错误,还是 class_distribution 属性指定的不是触发规则的每个类的实例数(来自用于学习规则的整个数据集)?
这个 class_distribution 是用来计算规则的质量的吗?这意味着 class_distribution 计算中的错误会导致规则质量计算中的错误。