r - R中OneR算法的过程

Question

我使用了 FSelecter 包的 OneR 算法来找到错误率最低的属性。我的班级属性是是和否。我的属性特征也是yes和no。

OneR 算法的结果是：

Ranking-No. 1

Atribut-Name: OR1: 

Matrix: ------ 0(Attribut-Characteristic)  -- 1 (Attribut Characteristics

0(Class):-------------------25243-------------------0

1(Class: -------------------1459-------------------18

Error-Rate: 1459 (0 + 1459)

Ranking-No. 2

Atribut-Name: OR2: 

Matrix: ------ 0(Attribut-Characteristic)  -- 1 (Attribut Characteristics

0(Class):-------------------25243-------------------0

1(Class: -------------------1460-------------------17

Error-Rate: 1460 (0 + 1460)

但是，如果我在同一数据帧上使用相关函数，则最佳属性的错误率低于使用 oneR 函数获得的属性。

Atribut-Name: CO4: 

Matrix: ------ 0(Attribut-Characteristic)  -- 1 (Attribut Characteristics

0(Class):-------------------25204-------------------39

1(Class: -------------------1348-------------------129

Error-Rate: 1387 (39 + 1348)

谁能告诉我，为什么 OneR 算法没有将 CO4 属性显示为最佳属性（基于错误率）？

OneR 算法使用哪些标准？

---除了更好地理解我的问题---

完整的数据太大而无法显示。我新建了一个数据池，效果一样

延迟 - OR1 - CO4 ..

1 ---------1--------1--

0 ---------0--------0--

0 ---------0--------1--

1 ---------0--------1--

0 ---------0--------0--

1 ---------0--------1--

0 ---------0--------0--

1 ---------0--------1--

显示单个属性的错误率的代码：

打印（表（datapool_stackoverflow$DELAYED，datapool_stackoverflow$OR1））

OneR函数的代码：

库（FSelector）

oneR_stackoverflow <- oneR(延迟~., datapool_stackoverflow)

subset_stackoverflow <- cutoff.k(oneR_stackoverflow, 2)

打印（子集堆栈溢出）

相关代码：

cor（as.numeric（datapool_stackoverflow$DELAYED），as.numeric（datapool_stackoverflow$OR1））

在这种情况下，结果是：

错误率：OR1 矩阵：------ 0（属性特征）-- 1（属性特征

0（类）：----------4------------------------ --0

1(班级：---------------------3------------ -1

曼努埃尔计算的错误率：3（0 + 3）

错误率：CO4 矩阵：------ 0（属性-特征）-- 1（属性特征）

0（类）：----------3---------- --1

1（等级：------------------------0---------- -4

错误率：1（1 + 0）

相关性：属性 OR1：0.377 属性 CO4：0.77

OneR：“OR1”、“CO4”

为什么，OneR 函数提供 OR1 属性作为分类的最佳属性？

score 0 · Accepted Answer

您没有给出数据的类型，但我假设您有数值。FSelector 在使用它们之前将这些值离散化oneR，似乎那里发生了不好的事情（这可能是 RWekaDiscretize函数中的错误）。但是，您可能无论如何都需要因子变量而不是数字数据，因为您只有 0-1 值。然后一切对我来说都很好：

> df = data.frame(delayed=factor(c(1,0,0,1,0,1,0,1)), or1 = factor(c(1,0,0,0,0,0,0,0)), co4 = factor(c(1,0,1,1,0,1,0,1)))
> library(FSelector)
> oneR(delayed~., df)
    attr_importance
or1       0.2000000
co4       0.4285714

正如你所看到的，co4 现在比 or1 具有更高的重要性，它应该有。

score 0 · Accepted Answer

好的，我有解决方案。该算法计算属性中特征的错误率的总和（相对于特征的最大值）

在这个例子中：

属性 OR1：3/7 + 0/1 = 3/7

属性 CO4：0/3 + 1/5 = 0.2

score 0 · Accepted Answer

不，CO4应该选择，选择其他属性是错误的 - 看看 OneR 包（在 CRAN 上可用）给出了什么：

> library(OneR)
> DELAYED <- c(1, 0, 0, 1, 0, 1, 0, 1)
> OR1 <- c(1, rep(0, 7))
> CO4 <- c(1, 0, 1, 1, 0, 1, 0, 1)
> 
> data <- data.frame(DELAYED, OR1, CO4)
> 
> model <- OneR(formula = DELAYED ~., data = data, verbose = T)

    Attribute Accuracy
1 * CO4       87.5%   
2   OR1       62.5%   
---
Chosen attribute due to accuracy
and ties method (if applicable): '*'

> summary(model)

Rules:
If CO4 = 0 then DELAYED = 0
If CO4 = 1 then DELAYED = 1

Accuracy:
7 of 8 instances classified correctly (87.5%)

Contingency table:
       CO4
DELAYED   0   1 Sum
    0   * 3   1   4
    1     0 * 4   4
    Sum   3   5   8
---
Maximum in each column: '*'

Pearson's Chi-squared test:
X-squared = 2.1333, df = 1, p-value = 0.1441

> 
> model_2 <- OneR(formula = DELAYED ~ OR1, data = data)
> summary(model_2)

Rules:
If OR1 = 0 then DELAYED = 0
If OR1 = 1 then DELAYED = 1

Accuracy:
5 of 8 instances classified correctly (62.5%)

Contingency table:
       OR1
DELAYED   0   1 Sum
    0   * 4   0   4
    1     3 * 1   4
    Sum   7   1   8
---
Maximum in each column: '*'

Pearson's Chi-squared test:
X-squared = 0, df = 1, p-value = 1

您可以在此处找到有关 OneR 包的更多信息：https ://github.com/vonjd/OneR

（完全披露：我是这个包的作者）

r - R中OneR算法的过程

3 回答 3

Related

Reference