r - data.table 中的 .internal.selfref 无效

Question

我需要分配一个“第二个” id 来对我原来id的 . 这是我的样本数据：

dt<-structure(list(id = c("aaaa", "aaaa", "aaas", "aaas", "bbbb", "bbbb"),
                   period = c("start", "end", "start", "end", "start", "end"),
                   date = structure(c(15401L, 15401L, 15581L, 15762L, 15430L, 15747L), class = c("IDate", "Date"))),
              class = c("data.table", "data.frame"),
              .Names = c("id", "period", "date"),
              sorted = "id")
> dt
     id period       date
1: aaaa  start 2012-03-02
2: aaaa    end 2012-03-05
3: aaas  start 2012-08-21
4: aaas    end 2013-02-25
5: bbbb  start 2012-03-31
6: bbbb    end 2013-02-11

需要根据此列表对列id进行分组（在 say 中使用相同的值）：id2

> groups
[[1]]
[1] "aaaa" "aaas"

[[2]]
[1] "bbbb"

我使用了以下代码，它似乎可以通过以下方式工作warning：

    > dt[, id2 := which(vapply(groups, function(x,y) any(x==y), .BY[[1]], FUN.VALUE=T)), by=id]
    Warning message:
    In `[.data.table`(dt, , `:=`(id2, which(vapply(groups, function(x,  :
      Invalid .internal.selfref detected and fixed by taking a copy of the whole table,
so that := can add this new column by reference. At an earlier point, this data.table has
been copied by R (or been created manually using structure() or similar). Avoid key<-,
names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use
set* syntax instead to avoid copying: setkey(), setnames() and setattr(). Also,
list (DT1,DT2) will copy the entire DT1 and DT2 (R's list() copies named objects),
use reflist() instead if needed (to be implemented). If this message doesn't help,
please report to datatable-help so the root cause can be fixed.
    > dt
         id period       date id2
    1: aaaa  start 2012-03-02   1
    2: aaaa    end 2012-03-02   1
    3: aaas  start 2012-08-29   1
    4: aaas    end 2013-02-26   1
    5: bbbb  start 2012-03-31   2
    6: bbbb    end 2013-02-11   2

有人可以简要解释此警告的性质以及对最终结果的任何最终影响（如果有的话）吗？谢谢

编辑：

以下代码实际上显示了何时dt创建以及如何传递给发出警告的函数：

f.main <- function(){
      f2 <- function(x){
      groups <- list(c("aaaa", "aaas"), "bbbb") # actually generated depending on the similarity between values of x$id
      x <- x[, id2 := which(vapply(groups, function(x,y) any(x==y), .BY[[1]], FUN.VALUE=T)), by=id]
      return(x)
  }
  x <- f1()
  if(!is.null(x[["res"]])){
    x <- f2(x[["res"]])
    return(x)
  } else {
    # something else
  }
}

f1 <- function(){
  dt<-data.table(id = c("aaaa", "aaaa", "aaas", "aaas", "bbbb", "bbbb"),
                 period = c("start", "end", "start", "end", "start", "end"),
                 date = structure(c(15401L, 15401L, 15581L, 15762L, 15430L, 15747L), class = c("IDate", "Date")))
  return(list(res=dt, other_results=""))
}

> f.main()
     id period       date id2
1: aaaa  start 2012-03-02   1
2: aaaa    end 2012-03-02   1
3: aaas  start 2012-08-29   1
4: aaas    end 2013-02-26   1
5: bbbb  start 2012-03-31   2
6: bbbb    end 2013-02-11   2
Warning message:
In `[.data.table`(x, , `:=`(id2, which(vapply(groups, function(x,  :
  Invalid .internal.selfref detected and fixed by taking a copy of the whole table,
so that := can add this new column by reference. At an earlier point, this data.table
has been copied by R (or been created manually using structure() or similar).
Avoid key<-, names<- and attr<- which in R currently (and oddly) may copy the whole
data.table. Use set* syntax instead to avoid copying: setkey(), setnames() and setattr().
Also, list(DT1,DT2) will copy the entire DT1 and DT2 (R's list() copies named objects),
use reflist() instead if needed (to be implemented). If this message doesn't help,
please report to datatable-help so the root cause can be fixed.

score 12 · Accepted Answer

是的，问题是列表。这是一个简单的例子：

DT <- data.table(1:5)
mylist1 <- list(DT,"a")
mylist1[[1]][,id:=.I]
#warning

mylist2 <- list(data.table(1:5),"a")
mylist2[[1]][,id:=.I]
#no warning

您应该避免将 data.table 复制到列表中（并且为了安全起见，我将完全避免在列表中包含 DT）。尝试这个：

f1 <- function(){
  mylist <- list(res=data.table(id = c("aaaa", "aaaa", "aaas", "aaas", "bbbb", "bbbb"),
                 period = c("start", "end", "start", "end", "start", "end"),
                 date = structure(c(15401L, 15401L, 15581L, 15762L, 15430L, 15747L), class = c("IDate", "Date"))))
  other_results <- ""
  mylist$other_results <- other_results
  mylist
}

score 12 · Accepted Answer

您可以在创建列表时“浅复制”，这样 1）您不会进行完整的内存复制（速度不受影响）和 2）您不会收到内部 ref 错误（感谢@mnel 这个技巧） .

创建数据：

set.seed(45)
ss <- function() {
    tt <- sample(1:10, 1e6, replace=TRUE)
}
tt <- replicate(100, ss(), simplify=FALSE)
tt <- as.data.table(tt)

您应该如何创建列表（浅拷贝）：

system.time( {
    ll <- list(d1 = { # shallow copy here...
        data.table:::settruelength(tt, 0)
        invisible(alloc.col(tt))
    }, "a")
})
user  system elapsed
   0       0       0
> system.time(tt[, bla := 2])
   user  system elapsed
  0.012   0.000   0.013
> system.time(ll[[1]][, bla :=2 ])
   user  system elapsed
  0.008   0.000   0.010

因此，您不会在速度上妥协，也不会在收到完整副本后收到警告。希望这可以帮助。

score 6 · Accepted Answer

“通过复制检测并修复了无效的 .internal.selfref……”

在 f2() 中分配 id2 时无需复制，您可以通过更改直接添加列：

# From:

      x <- x[, id2 := which(vapply(groups, function(x,y) any(x==y), .BY[[1]], FUN.VALUE=T)), by=id]

# To something along the lines of:
      x$id2 <- findInterval( match( x$id, unlist(groups)), cumsum(c(0,sapply(groups, length)))+1)

然后，您可以像往常一样继续使用您的“x”data.table，而不会产生警告。

f2(x[["res"]])此外，要简单地抑制警告，您可以在调用周围使用 suppressWarnings() 。

即使在小表上也可能存在很大的性能差异：

Performance Comparison:
Unit: milliseconds
                       expr      min       lq   median       uq      max neval
                   f.main() 2.896716 2.982045 3.034334 3.137628 7.542367   100
 suppressWarnings(f.main()) 3.005142 3.081811 3.133137 3.210126 5.363575   100
            f.main.direct() 1.279303 1.384521 1.413713 1.486853 5.684363   100

r - data.table 中的 .internal.selfref 无效

3 回答 3

创建数据：

您应该如何创建列表（浅拷贝）：

Related

Reference