11

I have a data.table which I want to split into two. I do this as follows:

dt <- data.table(a=c(1,2,3,3),b=c(1,1,2,2))
sdt <- split(dt,dt$b==2)

but if I want to to something like this as a next step

sdt[[1]][,c:=.N,by=a]

I get the following warning message.

Warning message: In [.data.table(sdt[[1]], , :=(c, .N), by = a) : Invalid .internal.selfref detected and fixed by taking a copy of the whole table, so that := can add this new column by reference. At an earlier point, this data.table has been copied by R. Avoid key<-, names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: setkey(), setnames() and setattr(). Also, list(DT1,DT2) will copy the entire DT1 and DT2 (R's list() copies named objects), use reflist() instead if needed (to be implemented). If this message doesn't help, please report to datatable-help so the root cause can be fixed.

Just wondering if there is a better way of splitting the table so that it would be more efficient (and would not get this message)?

4

3 回答 3

11

这适用于 v1.8.7(也可能适用于 v1.8.6):

> sdt = lapply(split(1:nrow(dt), dt$b==2), function(x)dt[x])
> sdt
$`FALSE`
   a b
1: 1 1
2: 2 1

$`TRUE`
   a b
1: 3 2
2: 3 2

> sdt[[1]][,c:=.N,by=a]     # now no warning
> sdt
$`FALSE`
   a b c
1: 1 1 1
2: 2 1 1

$`TRUE`
   a b
1: 3 2
2: 3 2

但是,正如@mnel 所说,这是低效的。如果可能,请避免拆分。

于 2013-02-20T11:25:09.003 回答
4

我正在寻找某种方法来拆分 data.table,我遇到了这个老问题。

有时拆分是您想要做的,而 data.table “按”方法并不方便。

实际上,您可以使用 data.table only 指令轻松地手动进行拆分,并且它的工作效率非常高:

SplitDataTable <- function(dt,attr) {
  boundaries=c(0,which(head(dt[[attr]],-1)!=tail(dt[[attr]],-1)),nrow(dt))
  return(
    mapply(
      function(start,end) {dt[start:end,]},
      head(boundaries,-1)+1,
      tail(boundaries,-1),
      SIMPLIFY=F))
}
于 2015-07-06T12:52:12.837 回答
3

如上所述(@jangorecki),包data.table已经有自己的拆分功能。在那个简化的情况下,我们可以使用:

> dt <- data.table(a = c(1, 2, 3, 3), b = c(1, 1, 2, 2))
> split(dt, by = "b")
$`1`
   a b
1: 1 1
2: 2 1

$`2`
   a b
1: 3 2
2: 3 2

对于更困难/具体的情况,我建议使用按引用函数在 data.table 中创建一个新变量,:=或者set然后调用该函数split。如果您关心性能,请确保始终保留在 data.table 环境中,例如,dt[, SplitCriteria := (...)]而不是在外部计算拆分变量。

于 2019-08-30T13:13:24.770 回答