r - rbindlist 两个 data.tables，其中一个具有因子，另一个具有列的字符类型

Question

我刚刚在我的脚本中发现这个警告有点奇怪。

# Warning message:
# In rbindlist(list(DT.1, DT.2)) : NAs introduced by coercion

观察 1：这是一个可重现的示例：

require(data.table)
DT.1 <- data.table(x = letters[1:5], y = 6:10)
DT.2 <- data.table(x = LETTERS[1:5], y = 11:15)

# works fine
rbindlist(list(DT.1, DT.2))
#     x  y
#  1: a  6
#  2: b  7
#  3: c  8
#  4: d  9
#  5: e 10
#  6: A 11
#  7: B 12
#  8: C 13
#  9: D 14
# 10: E 15

但是，现在如果我将列转换x为factor（有序与否）并执行相同操作：

DT.1[, x := factor(x)]
rbindlist(list(DT.1, DT.2))
#      x  y
#  1:  a  6
#  2:  b  7
#  3:  c  8
#  4:  d  9
#  5:  e 10
#  6: NA 11
#  7: NA 12
#  8: NA 13
#  9: NA 14
# 10: NA 15
# Warning message:
# In rbindlist(list(DT.1, DT.2)) : NAs introduced by coercion

但rbind这项工作做得很好！

rbind(DT.1, DT.2) # where DT.1 has column x as factor
# do.call(rbind, list(DT.1, DT.2)) # also works fine
#     x  y
#  1: a  6
#  2: b  7
#  3: c  8
#  4: d  9
#  5: e 10
#  6: A 11
#  7: B 12
#  8: C 13
#  9: D 14
# 10: E 15

x如果 column也是 an ，则可以重现相同的行为ordered factor。由于帮助页面?rbindlist说：Same as do.call("rbind",l), but much faster.，我猜这不是所需的行为？

这是我的会话信息：

# R version 3.0.0 (2013-04-03)
# Platform: x86_64-apple-darwin10.8.0 (64-bit)
# 
# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] data.table_1.8.8
# 
# loaded via a namespace (and not attached):
# [1] tools_3.0.0

编辑：

观察 2：遵循@AnandaMahto 的另一个有趣观察，颠倒顺序：

# column x in DT.1 is still a factor
rbindlist(list(DT.2, DT.1))
#     x  y
#  1: A 11
#  2: B 12
#  3: C 13
#  4: D 14
#  5: E 15
#  6: 1  6
#  7: 2  7
#  8: 3  8
#  9: 4  9
# 10: 5 10

在这里， from 列DT.1被默默地强制转换为numeric。
编辑：rbind(DT2, DT1)澄清一下，这与 DT1 的 x 列是一个因素的行为相同。rbind似乎保留了第一个参数的类。我将把这部分留在这里，并提到在这种情况下，这是所需的行为，因为rbindlist它是rbind.

观察 3：如果现在，两列都转换为因子：

# DT.1 column x is already a factor
DT.2[, x := factor(x)]
rbindlist(list(DT.1, DT.2))
#     x  y
#  1: a  6
#  2: b  7
#  3: c  8
#  4: d  9
#  5: e 10
#  6: a 11
#  7: b 12
#  8: c 13
#  9: d 14
# 10: e 15

在这里，xfrom列DT.2丢失（/ 替换为 of DT.1）。如果顺序相反，则会发生完全相反的情况（第 x 列DT.1被替换为DT.2）。

一般来说，factor处理rbindlist.

score 7 · Accepted Answer

更新 - 此错误 ( #2650 ) 已于 2013 年 5 月 17 日在 v1.8.9 中修复

我相信，rbindlist当应用于因素时，是结合因素的数值并仅使用与第一个列表元素相关的级别。

在这个错误报告中： http ://r-forge.r-project.org/tracker/index.php?func=detail&aid=2650&group_id=240&atid=975

# Temporary workaround: 

levs <- c(as.character(DT.1$x), as.character(DT.2$x))

DT.1[, x := factor(x, levels=levs)]
DT.2[, x := factor(x, levels=levs)]

rbindlist(list(DT.1, DT.2))

作为对正在发生的事情的另一种看法：

DT3 <- data.table(x=c("1st", "2nd"), y=1:2)
DT4 <- copy(DT3)

DT3[, x := factor(x, levels=x)]
DT4[, x := factor(x, levels=x, labels=rev(x))]

DT3
DT4

# Have a look at the difference:
rbindlist(list(DT3, DT4))$x
# [1] 1st 2nd 1st 2nd
# Levels: 1st 2nd

do.call(rbind, list(DT3, DT4))$x
# [1] 1st 2nd 2nd 1st
# Levels: 1st 2nd

根据评论编辑：

至于观察 1，发生的事情类似于：

x <- factor(LETTERS[1:5])

x[6:10] <- letters[1:5]
x

# Notice however, if you are assigning a value that is already present
x[11] <- "S"  # warning, since `S` is not one of the levels of x
x[12] <- "D"  # all good, since `D` *is* one of the levels of x

score 2 · Accepted Answer

rbindlist超快，因为它不检查rbindfill或do.call(rbind.data.frame,...)

您可以使用这样的解决方法来确保将因素强制转换为字符。

DT.1 <- data.table(x = factor(letters[1:5]), y = 6:10)
DT.2 <- data.table(x = LETTERS[1:5], y = 11:15)


for(ii in seq_along(DDL)){
  ff <- Filter(function(x) is.factor(DDL[[ii]][[x]]), names(DDL[[ii]]))
  for(fn in ff){
    set(DDL[[ii]], j = fn, value = as.character(DDL[[ii]][[fn]]))
    }
  }
 rbindlist(DDL)

或（有效减少内存）

rbindlist(rapply(DDL, classes = 'factor', f= as.character, how = 'replace'))

score 0 · Accepted Answer

该错误未在R 4.0.2和中修复data.table 1.13.0。当我尝试使用rbindlist()两个 DT 时，其中一个具有因子列，另一个为空，最终结果使该列损坏，并且因子值损坏（\n 随机发生；级别已损坏，引入了 NA）。
解决方法是不要将一个 DT 与一个空的 rbindlist 一起列出，而是将其与其他也具有有效负载数据的 DT 一起 rbindlist。虽然这需要一些样板代码。

r - rbindlist 两个 data.tables，其中一个具有因子，另一个具有列的字符类型

编辑：

3 回答 3

更新 - 此错误 ( #2650 ) 已于 2013 年 5 月 17 日在 v1.8.9 中修复

根据评论编辑：

Related

Reference