r - 在数字运行中查找范围

Question

我试图在数据框中找到多年的运行（最好使用 plyr）

我想从中得到：

require(plyr)

dat<-data.frame(
  name=c(rep("A", 11), rep("B", 11)),
  year=c(2000:2010, 2000:2005, 2007:2011)
  )

对此：

out<-data.frame(
  name=c("A", "B", "B"),
  range=c("2000-2010", "2000-2005", "2007-2011"))

很容易确定每个组是否有连续的年份：

ddply(dat, .(name), summarise,
      continuous=(max(year)-min(year))+1==length(year))

如何将“B”组分解为两个范围？

任何想法或策略将不胜感激。

谢谢

score 7 · Accepted Answer

无论您使用“plyr”或base R中的函数，您都需要首先建立一些组。由于您的年份是连续的，因此检测组变化的一种方法是查找diff不等于 1的位置。diff创建一个长度小于输入向量 1 的向量，因此我们将使用“1”初始化它并cumsum取结果。

把满嘴的解释付诸实践，你可以尝试这样的事情：

dat$id2 <- cumsum(c(1, diff(dat$year) != 1))

从这里，您可以使用aggregate或您最喜欢的分组功能来获得您正在寻找的输出。

aggregate(year ~ name + id2, dat, function(x) paste(min(x), max(x), sep = "-"))
#   name id2      year
# 1    A   1 2000-2010
# 2    B   2 2000-2005
# 3    B   3 2007-2011

要使用rangewith aggregate，您需要更改sep为collapse，如下所示：

aggregate(year ~ name + id2, dat, function(x) paste(range(x), collapse = "-"))

score 2 · Accepted Answer

Tooting my own horn, cgwtools::seqle can be used to identify the splits. Run a loop or *apply over the names elements, and for each case,

foo <- seqle(dat$year, incr=1)

Then length(foo$lengths) will give you the number of groups, and the range of years is easily reconstructed from foo$values .

yeargroups <-  sapply(length(foo$lengths), function(x) c(foo$values[x],(foo$values[x]+foo$lengths[x]-1)))

Just proposing this in case someone has a similar situation with different parameters or desired subdivisions.

r - 在数字运行中查找范围

2 回答 2

Related

Reference