r - 按行将自定义函数应用于 data.table 会返回不正确的值数量

Question

我对 data.tables 有点陌生，我有一个包含 DNA 基因组坐标的表格，如下所示：

       chrom   pause strand coverage
    1:     1 3025794      +        1
    2:     1 3102057      +        2
    3:     1 3102058      +        2
    4:     1 3102078      +        1
    5:     1 3108840      -        1
    6:     1 3133041      +        1

我编写了一个自定义函数，我想将它应用于我大约 200 万行表的每一行，它使用 GenomicFeatures 的 mapToTranscripts 以字符串和新坐标的形式检索两个相关值。我想将它们添加到我的表中的两个新列中，如下所示：

       chrom   pause strand coverage       transcriptID CDS
    1:     1 3025794      +        1 ENSMUST00000116652 196
    2:     1 3102057      +        2 ENSMUST00000116652  35
    3:     1 3102058      +        2 ENSMUST00000156816 888
    4:     1 3102078      +        1 ENSMUST00000156816 883
    5:     1 3108840      -        1 ENSMUST00000156816 882
    6:     1 3133041      +        1 ENSMUST00000156816 880

功能如下：

    get_feature <- function(dt){

      coordinate <- GRanges(dt$chrom, IRanges(dt$pause, width = 1), dt$strand) 
      hit <- mapToTranscripts(coordinate, cds_canonical, ignore.strand = FALSE) 
      tx_id <- tx_names[as.character(seqnames(hit))] 
      cds_coordinate <- sapply(ranges(hit), '[[', 1)

      if(length(tx_id) == 0 || length(cds_coordinate) == 0) {  
        out <- list('NaN', 0)
      } else {
        out <- list(tx_id, cds_coordinate)
      }

      return(out)
    }

然后，我这样做：

    counts[, c("transcriptID", "CDS"):=get_feature(.SD), by = .I]

我收到此错误，表明该函数返回两个长度比原始表短的列表，而不是每行一个新元素：

Warning messages:
    1: In `[.data.table`(counts, , `:=`(c("transcriptID", "CDS"),  ... :
      Supplied 1112452 items to be assigned to 1886614 items of column 'transcriptID' (recycled leaving remainder of 774162 items).
    2: In `[.data.table`(counts, , `:=`(c("transcriptID", "CDS"),  ... :
      Supplied 1112452 items to be assigned to 1886614 items of column 'CDS' (recycled leaving remainder of 774162 items).

我假设使用.I运算符将逐行应用该函数并每行返回一个值。我还确保函数没有使用if语句返回空值。

然后我尝试了这个函数的模拟版本：

    get_feature <- function(dt) {

      return('I should be returned once for each row')

    }

并这样称呼它：

    new.table <- counts[, get_feature(.SD), by = .I]

它制作了一个 1 行数据表，而不是一个原始长度。所以我得出结论，我的函数，或者可能是我调用它的方式，正在以某种方式折叠结果向量的元素。我究竟做错了什么？

更新（带有解决方案）：正如@StatLearner 指出的那样，在这个答案中解释说，如中所述?data.table，.I仅用于j（如中DT[i,j,by=]）。因此，by=.I等价于by=NULL并且正确的语法是by=1:nrow(dt)按行号分组并按行应用函数。

不幸的是，对于我的特殊情况，这完全是低效的，我计算出 100 行的执行时间为 20 秒。对于我需要 3 个月才能完成的 3600 万行数据集。

就我而言，我不得不放弃并mapToTranscripts像这样在整个桌子上使用该功能，这需要几秒钟，显然是预期的用途。

    get_features <- function(dt){
      coordinate <- GRanges(dt$chrom, IRanges(dt$pause, width = 1), dt$strand) # define coordinate
      hits <- mapToTranscripts(coordinate, cds_canonical, ignore.strand = FALSE) # map it to a transcript
      tx_hit <- as.character(seqnames(hits)) # get transcript number
      tx_id <- tx_names[tx_hit] # get transcript name from translation table

      return(data.table('transcriptID'= tx_id, 
                       'CDS_coordinate' =  start(hits))
    }

     density <- counts[, get_features(.SD)]

mapFromTranscripts然后使用from包映射回基因组，GenomicFeatures这样我就可以使用data.tables连接从原始表中检索信息，这是我尝试做的预期目的。

score 4 · Accepted Answer

当我需要为 data.table 中的每一行应用一个函数时，我这样做的方式是按行号对其进行分组：

counts[, get_feature(.SD), by = 1:nrow(counts)]

正如在这个答案中所解释的那样，.I它不打算用于 inby因为它应该返回由分组产生的行索引序列。by = .I不抛出错误的原因是 data.table 在 data.table 命名空间中创建对象.Iequals NULL，因此by = .I相当于by = NULL.

请注意，by=1:nrow(dt)按行号使用组并允许您的函数仅访问 data.table 中的一行：

require(data.table)
counts <- data.table(chrom = sample.int(10, size = 100, replace = TRUE),
                     pause = sample((3 * 10^6):(3.2 * 10^6), size = 100), 
                     strand = sample(c('-','+'), size = 100, replace = TRUE),
                     coverage = sample.int(3, size = 100, replace = TRUE))

get_feature <- function(dt){
    coordinate <- data.frame(dt$chrom, dt$pause, dt$strand)
    rowNum <- nrow(coordinate)
    return(list(text = 'Number of rows in dt', rowNum = rowNum))  
}

counts[, get_feature(.SD), by = 1:nrow(counts)]

将生成一个与 in 具有相同行数的 data.table counts，但coordinate将仅包含来自counts

   nrow                 text rowNum
1:    1 Number of rows in dt      1
2:    2 Number of rows in dt      1
3:    3 Number of rows in dt      1
4:    4 Number of rows in dt      1
5:    5 Number of rows in dt      1

whileby = NULL会将整个 data.table 提供给函数：

counts[, get_feature(.SD), by = NULL]

                   text rowNum
1: Number of rows in dt    100

这是预期的by工作方式。

r - 按行将自定义函数应用于 data.table 会返回不正确的值数量

1 回答 1

Related

Reference