r - R：`split`保持因子的自然顺序

Question

split将始终按字典顺序排列拆分。在某些情况下，人们宁愿保持自然秩序。一个人总是可以实现一个手动功能，但是有一个基本的 R 解决方案可以做到这一点吗？

可重现的例子：

输入：

  Date.of.Inclusion Securities.Included Securities.Excluded yearmon
1        2013-04-01          INDUSINDBK             SIEMENS  4 2013
2        2013-04-01                NMDC               WIPRO  4 2013
3        2012-09-28               LUPIN                SAIL  9 2012
4        2012-09-28          ULTRACEMCO                STER  9 2012
5        2012-04-27          ASIANPAINT                RCOM  4 2012
6        2012-04-27          BANKBARODA              RPOWER  4 2012

split输出：

R> split(nifty.dat, nifty.dat$yearmon)
$`4 2012`
  Date.of.Inclusion Securities.Included Securities.Excluded yearmon
5        2012-04-27          ASIANPAINT                RCOM  4 2012
6        2012-04-27          BANKBARODA              RPOWER  4 2012

$`4 2013`
  Date.of.Inclusion Securities.Included Securities.Excluded yearmon
1        2013-04-01          INDUSINDBK             SIEMENS  4 2013
2        2013-04-01                NMDC               WIPRO  4 2013

$`9 2012`
  Date.of.Inclusion Securities.Included Securities.Excluded yearmon
3        2012-09-28               LUPIN                SAIL  9 2012
4        2012-09-28          ULTRACEMCO                STER  9 2012

请注意，yearmon它已经按我喜欢的特定顺序排序。这可以被认为是给定的，因为如果这不成立，这个问题会被稍微错误地指定。

期望的输出：

$`4 2013`
  Date.of.Inclusion Securities.Included Securities.Excluded yearmon
1        2013-04-01          INDUSINDBK             SIEMENS  4 2013
2        2013-04-01                NMDC               WIPRO  4 2013

$`9 2012`
  Date.of.Inclusion Securities.Included Securities.Excluded yearmon
3        2012-09-28               LUPIN                SAIL  9 2012
4        2012-09-28          ULTRACEMCO                STER  9 2012

$`4 2012`
  Date.of.Inclusion Securities.Included Securities.Excluded yearmon
5        2012-04-27          ASIANPAINT                RCOM  4 2012
6        2012-04-27          BANKBARODA              RPOWER  4 2012

谢谢。

PS：我知道有更好的方法来创建yearmon以保留该顺序，但我正在寻找一个通用的解决方案。

score 25 · Accepted Answer

split将f（第二个）参数转换为因子，如果它还不是一个。因此，如果您希望保留订单，请自行将列与所需的级别相结合。那是：

df$yearmon <- factor(df$yearmon, levels=unique(df$yearmon))
# now split
split(df, df$yearmon)
# $`4_2013`
#   Date.of.Inclusion Securities.Included Securities.Excluded yearmon
# 1        2013-04-01          INDUSINDBK             SIEMENS  4_2013
# 2        2013-04-01                NMDC               WIPRO  4_2013

# $`9_2012`
#   Date.of.Inclusion Securities.Included Securities.Excluded yearmon
# 3        2012-09-28               LUPIN                SAIL  9_2012
# 4        2012-09-28          ULTRACEMCO                STER  9_2012

# $`4_2012`
#   Date.of.Inclusion Securities.Included Securities.Excluded yearmon
# 5        2012-04-27          ASIANPAINT                RCOM  4_2012
# 6        2012-04-27          BANKBARODA              RPOWER  4_2012

但不要使用`split`. 改用`data.table`：

然而通常情况下，随着水平的增加，split它往往会非常缓慢。所以，我建议使用data.table子集到列表中。我想那会快得多！

require(data.table)
dt <- data.table(df)
dt[, grp := .GRP, by = yearmon]
setkey(dt, grp)
o2 <- dt[, list(list(.SD)), by = grp]$V1

对海量数据进行基准测试：

set.seed(45)
dates <- seq(as.Date("1900-01-01"), as.Date("2013-12-31"), by = "days")
ym <- do.call(paste, c(expand.grid(1:500, 1900:2013), sep="_"))

df <- data.frame(x1 = sample(dates, 1e4, TRUE), 
                 x2 = sample(letters, 1e4, TRUE), 
                 x3 = sample(10, 1e4, TRUE), 
                 yearmon = sample(ym, 1e4, TRUE), 
      stringsAsFactors=FALSE)

require(data.table)
dt <- data.table(df)

f1 <- function(dt) {
    dt[, grp := .GRP, by = yearmon]
    setkey(dt, grp)

    o1 <- dt[, list(list(.SD)), by=grp]$V1
}

f2 <- function(df) {
    df$yearmon <- factor(df$yearmon, levels=unique(df$yearmon))
    o2 <- split(df, df$yearmon)
}

require(microbenchmark)
microbenchmark(o1 <- f1(dt), o2 <- f2(df), times = 10)

# Unit: milliseconds
         expr        min         lq     median        uq      max neval
#  o1 <- f1(dt)   43.72995   43.85035   45.20087  715.1292 1071.976    10
#  o2 <- f2(df) 4485.34205 4916.13633 5210.88376 5763.1667 6912.741    10

请注意，来自的解决方案o1将是一个未命名的列表。但是您可以简单地设置名称names(o1) <- unique(dt$yearmon)

r - R：`split`保持因子的自然顺序

可重现的例子：

1 回答 1

但不要使用split. 改用data.table：

对海量数据进行基准测试：

Related

Reference

但不要使用`split`. 改用`data.table`：