r - 如何矢量化和加速数据帧上的 strtime() 对数时间转换

Question

（编辑：这里的问题之一是规模，即对一行有效的内容将在 200,000 * 50 数据帧上炸毁/崩溃 R。例如，必须按列而不是按行应用 strptime 以避免挂起。我正在寻找您实际在 200,000 * 50 上运行的工作代码解决方案，包括您测量的运行时间，而不仅仅是随意的“这很容易”评论。如果您选择错误的 fn，很容易获得运行时间 > 12 小时。接下来，我也要求你让我的零时间调整代码更快，工作直到完成才完成。到目前为止没有人尝试过。）

我想矢量化和加速以下多步对数时间转换，精度为毫秒，涉及转换strtime()为单个数字，然后是减法，然后log()是大型数据帧（200,000 行 * 300 列；其他（非时间）列省略）。代码如下。除了使其矢量化和快速之外，一个额外的问题是我不确定如何最好地在每个步骤中表示（高维）中间值，例如作为 strtime、矩阵、向量的列表）。我已经尝试过apply,sapply,lapply,vapply,ddply::maply(),...了，但是中间格式的不兼容一直让我很困惑......

每行有 50 列time1..time50 (chr, format="HH:MM:SS.sss") 表示时间为毫秒分辨率的字符串。我需要毫秒精度。在每一行中，列time1..time50处于非递减顺序，我想将它们转换为time50之前的时间日志。转换 fnparse_hhmmsecms()位于底部，需要认真矢量化和加速，您可以看到注释掉的替代版本。到目前为止我想到的是：strtime()比（多个）substr()调用更快，然后我以某种方式转换为三个 numeric 的列表(hh,mm,sec.ms)，然后转换为向量假设下一步应该是向量乘以%*% c(3600,60,1)转换为数字秒。这是我为每一行和每个时间字符串所做的伪代码；完整代码在底部：

 for each row in dataframe { # vectorize this, loop_apply(), or whatever...
 #for each time-column index i ('time1'..'time50') { # vectorize this...
 hhmmsecms_50 <- parse_hhmmsecms(xx$time50[i])
 # Main computation
 xx[i,Clogtime] <- -10*log10(1000*(hhmmsecms_50 - parse_hhmmsecms(xx[i,Ctime]) ))
 # Minor task: fix up all the 'zero-time' events to be evenly spaced between -3..0
 #}
 }

所以涉及到五个子问题：

如何矢量化处理返回的列表strtime()？因为它返回一个包含 3 个项目的列表，当传递一个 2D 数据帧或 1D 行时间字符串时，我们将得到一个 3D 或 2D 中间对象。（我们在内部使用列表列表吗？列表矩阵？列表数组？）
如何向量化整个函数parse_hhmmsecms()？
然后做减法并记录
向量化零时间修复代码（这是目前为止最慢的部分）
如何加速步骤 1...4.？

下面使用十个示例列的代码片段time41..50 （random_hhmmsecms()如果您想要更大的示例，请使用）

我尽力遵循这些建议，这在六个小时的工作中可以重现：

# Each of 200,000 rows has 50 time strings (chr) like this...    
xx <- structure(list(time41 = c("08:00:41.465", "08:00:50.573", "08:00:50.684"
), time42 = c("08:00:41.465", "08:00:50.573", "08:00:50.759"), 
    time43 = c("08:00:41.465", "08:00:50.573", "08:00:50.759"
    ), time44 = c("08:00:41.465", "08:00:50.664", "08:00:50.759"
    ), time45 = c("08:00:41.465", "08:00:50.684", "08:00:50.759"
    ), time46 = c("08:00:42.496", "08:00:50.684", "08:00:50.759"
    ), time47 = c("08:00:42.564", "08:00:50.759", "08:00:51.373"
    ), time48 = c("08:00:48.370", "08:00:50.759", "08:00:51.373"
    ), time49 = c("08:00:50.573", "08:00:50.759", "08:00:54.452"
    ), time50 = c("08:00:50.573", "08:00:50.759", "08:00:54.452"
    )), .Names = c("time41", "time42", "time43", "time44", "time45", 
"time46", "time47", "time48", "time49", "time50"), row.names = 3:5, class = "data.frame")

# Handle millisecond timing and time conversion
options('digits.secs'=3)

# Parse "HH:MM:SS.sss" timestring into (numeric) number of seconds (Very slow)
parse_hhmmsecms <- function(t) {
  as.numeric(substr(t,1,2))*3600 + as.numeric(substr(t,4,5))*60 + as.numeric(substr(t,7,12)) # WORKS, V SLOW

  #c(3600,60,1) %*% sapply((strsplit(t[1,]$time1, ':')), as.numeric) # SLOW, NOT VECTOR

  #as.vector(as.numeric(unlist(strsplit(t,':',fixed=TRUE)))) %*% c(3600,60,1) # WANT TO VECTORIZE THIS
}

random_hhmmsecms <- function(n=1, min=8*3600, max=16*3600) {
# Generate n random hhmmsecms objects between min and max (8am:4pm)
xx <- runif(n,min,max)
ss <- xx %%  60
mm <- (xx %/% 60) %% 60
hh <- xx %/% 3600
sprintf("%02d:%02d:%05.3f", hh,mm,ss)
}

xx$logtime45 <- xx$logtime44 <- xx$logtime43 <- xx$logtime42  <- xx$logtime41  <- NA
xx$logtime50 <- xx$logtime49 <- xx$logtime48 <- xx$logtime47  <- xx$logtime46  <- NA

# (we pass index vectors as the dataframe column ordering may change) 
Ctime <- which(colnames(xx)=='time41') : which(colnames(xx)=='time50')
Clogtime <- which(colnames(xx)=='logtime41') : which(colnames(xx)=='logtime50')
for (i in 40:nrow(xx)) {
  #if (i%%100==0) { print(paste('... row',i)) }

  hhmmsecms_50 <- parse_hhmmsecms(xx$time50[i])
  xx[i,Clogtime] <- -10*log10(1000*(hhmmsecms_50 - parse_hhmmsecms(xx[i,Ctime]) ))

  # Now fix up all the 'zero-time' events to be evenly spaced between -3..0
  Czerotime.p <- which(xx[i,Clogtime]==Inf | xx[i,Clogtime]>-1e-9)
  xx[i,Czerotime.p] <- seq(-3,0,length.out=length(Czerotime.p))  
}

score 2 · Accepted Answer

你可能把事情复杂化了。

从可以很好地执行毫秒（在适当的操作系统上甚至是微秒）的基类开始，但请注意

您需要设置options("digits.secs"=7)（这是可以显示的最大值）才能看到它们显示
你需要一个额外的解析字符strptime等

所有这些都在文档中，以及在 SO 上的无数示例。

快速示例：

R> someTime <- ISOdatetime(2011, 12, 27, 2, 3, 4.567)
R> someTime
[1] "2011-12-27 02:03:04.567 CST"
R> now <- Sys.time()
R> now
[1] "2011-12-27 16:48:20.247298 CST"      # microsecond display on Linux
R> 
R> txt <- "2001-02-03 04:05:06.789123"
R> strptime(txt, "%Y-%m-%d %H:%M:%OS")    # note the %0S for sub-seconds
[1] "2001-02-03 04:05:06.789123"
R>

并且诸如strptimeor之类的关键函数as.POSIXct都是矢量化的，您可以将整列扔给它们。

r - 如何矢量化和加速数据帧上的 strtime() 对数时间转换

1 回答 1

Related

Reference