5

对于R中的data.table(或data.frame),我希望找到所有包含“值”列中的值的行,这些行与具有相同键的行中的另一个值的给定距离“距离”。因此,鉴于以下情况:

distance <- 22
   key value
   A     1
   B     1
   C     1
   D     1
   A     4
   B     4
   A    23
   B    23
   B    26
   B    26
   C    30

我想用相同的键存在多少行的计数来注释原始表,以及从它的 +22 的值:

  key value count
  A     1     1
  B     1     1
  C     1     0
  D     1     0
  A     4     0
  B     4     2
  A    23     0
  B    23     0
  B    26     0
  B    26     0
  C    30     0

我真的不知道从哪里开始使用这种在 R 中操作数据的自引用方法。我最初的尝试涉及创建第二个表并尝试与之匹配,但这似乎是一种奇怪且糟糕的方法。

注意:我正在使用该data.table包,但我很乐意在这种情况下使用 data.frame 工作,如果这样可以让事情变得更容易。

可重现:

require(data.table)
source <- data.table(data.frame(key=c("A","B","C","D","A","B","A","B","B","B", "C"),value=c(1,1,1,1,4,4,23,23,26,26,30)))
result <- data.table(data.frame(key=c("A","B","C","D","A","B","A","B","B","B","C"),value=c(1,1,1,1,4,4,23,23,26,26,30),count=c(1,1,0,0,0,2,0,0,0,0,0)))
4

2 回答 2

5

这是一个data.table基于的解决方案。我有兴趣了解可以对其进行哪些改进(如果有的话)。

# Your code
library(data.table)
source <- 
data.table(data.frame(key = c("A","B","C","D","A","B","A","B","B","B", "C"),
                      value = c(1,1,1,1,4,4,23,23,26,26,30)))

这很奇怪data.table(data.frame(...,因为data.table()也有一个名为 的参数keydata.table这是使用名为 的列创建 a 的一种方法"key"。大写以避免参数名称冲突说明了更标准的语法:

source <- data.table(Key = c("A","B","C","D","A","B","A","B","B","B","C"),
                     Value = c(1,1,1,1,4,4,23,23,26,26,30))

接下来为了避免as.integer()以后需要,我们将从现在更改列Value的类型。请记住,R 中的不是,就是。将数据存储为,通常比as更有效率。下一行比在上面输入很多 s 更容易。numericinteger1numeric1LintegerintegerintegerintegernumericL

source[,Value:=as.integer(Value)]   # change type from `numeric` to `integer`

现在继续

distance <- 22L
setkey(source, Key, Value)

# Heart of the solution (following a few explanatory comments):
#  "J()"   : shorthand for 'data.table()'
#  ".N"    : returns the number of rows that matched a line (see ?data.table)
#  "[[3]]" : as with simple data.frames, extracts the vector in column 3

source[,count:=source[J(Key,Value+distance),.N][[3]]]
source
      key value count
 [1,]   A     1     1
 [2,]   A     4     0
 [3,]   A    23     0
 [4,]   B     1     1
 [5,]   B     4     2
 [6,]   B    23     0
 [7,]   B    26     0
 [8,]   B    26     0
 [9,]   C     1     0
[10,]   C    30     0
[11,]   D     1     0

请注意,直接通过引用:=更改source,就是这样。而且setkey()还改变了原始数据的顺序。如果需要保留原始订单,则:

source <- data.table(Key = c("A","B","C","D","A","B","A","B","B","B","C"),
                     Value = c(1,1,1,1,4,4,23,23,26,26,30))
source[,Value:=as.integer(Value)]   
source[,count:=setkey(copy(source))[source[,list(Key,Value+distance)],.N][[3]]]

      Key Value count
 [1,]   A     1     1
 [2,]   B     1     1
 [3,]   C     1     0
 [4,]   D     1     0
 [5,]   A     4     0
 [6,]   B     4     2
 [7,]   A    23     0
 [8,]   B    23     0
 [9,]   B    26     0
[10,]   B    26     0
[11,]   C    30     0
于 2012-05-23T19:29:20.883 回答
1

您可以使用mapply循环键和值的所有组合:

data.table(t(mapply(function(key,val) 
      c(key=key,value=val,count=length(source$value[source$key==key & source$value>(val+distance)]) )
   , as.character(source$key),source$value)))
于 2012-05-23T18:31:09.503 回答