5

当使用有序因子rbinding 两个data.table时,排序似乎丢失了:

dtb1 = data.table(id = factor(c("a", "b"), levels = c("a", "c", "b"), ordered=T), key="id")
dtb2 = data.table(id = factor(c("c"), levels = c("a", "c", "b"), ordered=T), key="id") 
test = rbind(dtb1, dtb2)
is.ordered(test$id)
#[1] FALSE

有什么想法或想法吗?

4

3 回答 3

7

data.table做了一些花哨的步法,这意味着在对象上data.table:::.rbind.data.table调用时rbind调用,包括data.tables. .rbind.data.table利用与 相关的加速rbindlist,并进行一些额外的检查以按名称匹配等。

.rbind.data.table通过组合它们来处理因子列c(因此保留了级别属性)

# the relevant code is
l = lapply(seq_along(allargs[[1L]]), function(i) do.call("c", 
    lapply(allargs, "[[", i)))

以这种方式base R使用c不保留“有序”属性,它甚至不返回一个因子!

例如(在base R

f <- factor(1:2, levels = 2:1, ordered=TRUE)
g <- factor(1:2, levels = 2:1, ordered=TRUE)
# it isn't ordered!
is.ordered(c(f,g))
# [1] FALSE
# no suprise as it isn't even a factor!
is.factor(c(f,g))
# [1] FALSE

但是data.table有一个 S3 方法c.factor,用于确保返回一个因子并保留级别。不幸的是,此方法不保留有序属性。

getAnywhere('c.factor')
# A single object matching ‘c.factor’ was found
# It was found in the following places
#   namespace:data.table
# with value
# 
# function (...) 
# {
#     args <- list(...)
#     for (i in seq_along(args)) if (!is.factor(args[[i]])) 
#         args[[i]] = as.factor(args[[i]])
#     newlevels = unique(unlist(lapply(args, levels), recursive = TRUE, 
#         use.names = TRUE))
#     ind <- fastorder(list(newlevels))
#     newlevels <- newlevels[ind]
#     nm <- names(unlist(args, recursive = TRUE, use.names = TRUE))
#     ans = unlist(lapply(args, function(x) {
#         m = match(levels(x), newlevels)
#         m[as.integer(x)]
#     }))
    structure(ans, levels = newlevels, names = nm, class = "factor")
}
<bytecode: 0x073f7f70>
<environment: namespace:data.table

所以是的,这是一个错误。现在报告为#5019

于 2013-10-25T01:20:36.333 回答
1

1.8.11 版本 data.table开始,如果全局顺序存在,将结合有序因子来生成,ordered如果不存在,则会抱怨并生成因子:

DT1 = data.table(ordered('a', levels = c('a','b','c')))
DT2 = data.table(ordered('a', levels = c('a','d','b')))

rbind(DT1, DT2)$V1
#[1] a a
#Levels: a < d < b < c

DT3 = data.table(ordered('a', levels = c('b','a','c')))
rbind(DT1, DT3)$V1
#[1] a a
#Levels: a b c
#Warning message:
#In rbindlist(lapply(seq_along(allargs), function(x) { :
#  ordered factor levels cannot be combined, going to convert to simple factor instead

相比之下,base R 的作用如下:

rbind(data.frame(DT1), data.frame(DT2))$V1
#[1] a a
#Levels: a < b < c < d
# Notice that the resulting order does not respect the suborder for DT2

rbind(data.frame(DT1), data.frame(DT3))$V1
#[1] a a
#Levels: a < b < c
# Again, suborders are not respected and new order is created
于 2013-10-27T20:51:47.767 回答
-1

之后我遇到了同样的问题rbind,只需为列重新分配有序级别。

test$id <- factor(test$id, levels = letters, ordered = T)

最好在之后定义因子rbind

于 2015-07-29T08:07:11.030 回答