0

如何用 R 总结连续的深度数据?例如:

a <- data.frame(label = as.factor(c("Air","Air","Air","Air","Air","Air","Wood","Wood","Wood","Wood","Wood","Air","Air","Air","Air","Stone","Stone","Stone","Stone","Air","Air","Air","Air","Air","Wood","Wood")), 
                depth = as.numeric(c(1,2,3,-1,4,5,4,5,4,6,8,9,8,9,10,9,10,11,10,11,12,10,12,13,14,14)))

给定的输出应该是这样的:

Label Depth
Air    7
Wood   3
Stone  1

首先用 去除负值cummax(),因为深度只能在这种特殊情况下增加。因此:

   label depth
1    Air     1
2    Air     2
3    Air     3
4    Air     3
5    Air     4
6    Air     5
7   Wood     5
8   Wood     5
9   Wood     5
10  Wood     6
11  Wood     8
12   Air     9
13   Air     9
14   Air     9
15   Air    10
16 Stone    10
17 Stone    10
18 Stone    11
19 Stone    11
20   Air    11
21   Air    12
22   Air    12
23   Air    12
24   Air    13
25  Wood    14
26  Wood    14

现在通过 max-min ,您将获得的每个连续行的深度增加:(问题是如何执行此步骤)

   label depth
1   Air     4
2   Wood    3
3   Air     1
4   Stone   1
5   Air     2
5   Wood    0

最后总结这些最大值-最小值,输出就是上面给出的那个。

尝试实现输出的步骤:

例如,第一个明显的解决方案是 Air:

diff(cummax(a[a$label=="Air",]$depth))

该解决方案消除了负面数据,由于预期的深度不断增加,这是必要的。问题是输出还考虑了每个连续子集之间的大步。因此,Air 的总和将是 12 而不是 7。

 [1] 1 1 0 1 1 4 0 0 1 1 1 0 0 1

更糟糕的是使用aggreagte的解决方案,例如:

aggregate(depth~label, a, FUN=function(x){sum(x>0)})

注意:过滤大跳跃的解决方案不是我想要的。当然,您可以再次为 Air 的示例硬编码一个限制,例如 <2:

sum(diff(cummax(a[a$label=="Air",]$depth))[diff(cummax(a[a$label=="Air",]$depth))<2])

为您提供几乎正确的结果,但无法正常工作。我很确定我正在寻找的功能已经有了,因为对于许多不同的任务来说这不是一个不常见的问题。

我想采用每种材料的每组连续行的最小值和最大值并将它们相加将是一种可能的解决方案,但我不确定如何仅将函数应用于连续子集。

4

3 回答 3

1

您可以使用data.table::rleid快速分组运行,或者rle如果你真的喜欢重建它。在那之后,聚合在任何语法中都相当容易。在 dplyr 中,

library(dplyr)

a <- data.frame(label = c("Air","Air","Air","Air","Air","Air","Wood","Wood","Wood","Wood","Wood","Air","Air","Air","Air","Stone","Stone","Stone","Stone","Air","Air","Air","Air","Air","Wood","Wood"), 
                depth = c(1,2,3,-1,4,5,4,5,4,6,8,9,8,9,10,9,10,11,10,11,12,10,12,13,14,14))

a2 <- a %>% 
    # filter to rows where previous value is lower, equal, or NA
    filter(depth >= lag(depth) | is.na(lag(depth))) %>% 
    # group by label and its run
    group_by(label, run = data.table::rleid(label)) %>% 
    summarise(depth = max(depth) - min(depth))    # aggregate

a2 %>% arrange(run)    # sort to make it pretty
#> # A tibble: 6 x 3
#> # Groups:   label [3]
#>    label   run depth
#>   <fctr> <int> <dbl>
#> 1    Air     1     4
#> 2   Wood     2     3
#> 3    Air     3     1
#> 4  Stone     4     1
#> 5    Air     5     2
#> 6   Wood     6     0

a3 <- a2 %>% summarise(depth = sum(depth))    # a2 is still grouped, so aggregate more

a3
#> # A tibble: 3 x 2
#>    label depth
#>   <fctr> <dbl>
#> 1    Air     7
#> 2  Stone     1
#> 3   Wood     3
于 2017-06-25T17:51:54.757 回答
0

使用的基本 R 方法aggregate

aggregate(cbind(val=cummax(a$depth)),
          list(label=a$label, ID=c(0, cumsum(diff(as.integer(a$label)) != 0))),
          function(x) diff(range(x)))

aggregate 的第一个参数计算累积最大值,就像上面的 OP 对输入向量所做的那样,使用cbindprovide 来计算向量的最终输出。第二个参数是分组参数。这使用与 不同的方法rle,它计算差异的累积和。最后,第三个参数提供了通过取每个组的范围差来计算所需输出的函数。

这返回

  label ID val
1   Air  0   4
2  Wood  1   3
3   Air  2   1
4 Stone  3   1
5   Air  4   2
6  Wood  5   0
于 2017-06-25T19:41:49.953 回答
0

方式(data.table部分借用@alistaire):

setDT(a)
a[, depth := cummax(depth)]
depth_gain <- a[,
  list(
    depth = max(depth) - depth[1],  # Only need the starting and max values
    label = label[1]
  ),
  by = rleidv(label)
]
result <- depth_gain[, list(depth = sum(depth)), by = label]
于 2017-06-27T17:38:26.967 回答