2

我想从一个融化的矩阵中产生一个成对的错误,看起来像这样:

pw.data = data.frame(true_tree = rep(c("maple","oak","pine"),3), 
                 guess_tree = c(rep("maple",3),rep("oak",3),rep("pine",3)),
                 value = c(12,0,1,1,15,0,2,1,14))


true_tree guess_tree value
  maple      maple    12
    oak      maple     0
   pine      maple     1
  maple        oak     1
    oak        oak    15
   pine        oak     0
  maple       pine     2
    oak       pine     1
   pine       pine    14

所以我想估计真实树种和猜测树种之间的成对误差。对于此估计,公式应为“成对错误分配/所选两个物种的所有估计数。

为了给出更好的解释:枫木和橡木的错误猜测(枫木-橡木和橡木-枫木比较)= 1 + 0 / 所有猜测数 = 12 + 1 + 2(true_tree 的所有计数 == "枫木)+ 0 + 15 + 1(true_tree == "oak 的所有计数)。所以估计乘积是1/31。

当我检查一个特定的情况时,让我们再说一遍枫木和橡木,我可以手动估计它:

sum(pw.data[((pw.data[,1] == "maple" & pw.data[,2] == "oak") | 
      (pw.data[,1] == "oak" & pw.data[,2] == "maple")) &
      (pw.data[,1] != pw.data[,2]),3]) / 
 (sum(pw.data[pw.data[,1] == "maple",3]) + sum(pw.data[pw.data[,1] == "oak",3]))

但是,我想对更大的数据进行估计,因此,我想创建一个 for 循环/函数来进行估计并将结果存储在数据框中,例如:

Pw_tree   value
Maple-oak 0.0123
....

我试图在下面的 for 循环中使用该逻辑,但它根本不起作用。

for (i in pw.data[,1]) { 
for (j in pw.data[,2]) {
x = sum( pw.data[((pw.data[,1] == i & pw.data[,2] == j ) | 
                (pw.data[,1] == j & pw.data[,2] == i)) &
               (pw.data[,1] != pw.data[,2]),3])  
y = (sum(pw.data[pw.data[,1] == i,3]) + sum(pw.data[pw.data[,1] == j,3]))
   PWerr_data = data.frame( pw_tree = paste(i,j, sep = "-"), value = x/y)
 }

}

如果我能看到我做错了什么,那就太好了。非常感谢!

4

1 回答 1

2

我通常通过构建我想要应用的函数(你几乎已经完成)来解决这些类型的问题,然后构建最方便应用它的数据结构,然后我可以使用apply函数家族中的一个遍历我的数据结构以获得结果。这让我避免了for循环结构,这很好,因为我是那种总是会在双 for 循环中搞砸索引的程序员。

在您的情况下,我们可以将您的总和比率包装到一个以 data.frame 和两个树名作为参数的函数中。然后我们只需要创建我们想要使用的一组对。一个方便的功能是combn()让您m从 的元素中获取所有大小组合x:这将为我们提供我们想要的一组非冗余对。

注释示例代码如下:

# Load your data
pw.data = data.frame(true_tree = rep(c("maple","oak","pine"),3), 
                     guess_tree = c(rep("maple",3),rep("oak",3),rep("pine",3)),
                     value = c(12,0,1,1,15,0,2,1,14))
pw.data
#>   true_tree guess_tree value
#> 1     maple      maple    12
#> 2       oak      maple     0
#> 3      pine      maple     1
#> 4     maple        oak     1
#> 5       oak        oak    15
#> 6      pine        oak     0
#> 7     maple       pine     2
#> 8       oak       pine     1
#> 9      pine       pine    14

# build the function we will repeatedly apply
getErr <- function(t1, t2, data=pw.data) {
  # compute the rate as you wrote it
  rate <- sum(data[((pw.data[,1] == t1 & data[,2] == t2) | 
               (data[,1] == t2 & data[,2] == t1)) &
              (data[,1] != data[,2]),3]) / 
  (sum(data[data[,1] == t1,3]) + sum(data[data[,1] == t2,3]))

  # output the items involved as a named list (useful for later)
  list(Pw_tree = paste(t1, t2, sep='-'), error_rate = rate)
  }

# test it
getErr("maple", "oak")
#> $Pw_tree
#> [1] "maple-oak"
#> 
#> $error_rate
#> [1] 0.03225806
# Good this matches the output you supplied

# build the data structure we will run the function across
all.trees <- unique(c(as.character(pw.data$true_tree), as.character(pw.data$guess_tree)))
all.name.combos <- combn(all.trees, 2)

# we will use the do.call(rbind, ls) trick, where we generate a list
# with the apply function and coerce it to a matrix
error_rates_df <- do.call(rbind, apply(all.name.combos, 2, function(row){getErr(row[1], row[2])}))
error_rates_df
#>      Pw_tree      error_rate
#> [1,] "maple-oak"  0.03225806
#> [2,] "maple-pine" 0.1       
#> [3,] "oak-pine"   0.03225806

reprex 包(v0.2.1)于 2018 年 10 月 30 日创建

于 2018-10-30T12:00:02.177 回答