r - 使用 parse_date_time 将 dmy 格式的日期与 dmY 一起解析

Question

我有一个日期的字符表示向量，其中格式主要是dmY（例如 27-09-2013）、dmy（例如 27-09-13），偶尔也有一些b或B几个月。因此，parse_date_time在lubridate“允许用户指定几种格式顺序来处理异构日期时间字符表示”的包中，对我来说可能是一个非常有用的功能。

但是，当日期与日期一起出现时，parse_date_time解析日期似乎有问题。单独解析或与其他一些与我相关的格式一起解析时，它工作正常。在对@Peyton 的回答的评论中也注意到了这种模式。有人建议快速修复，但我想问一下是否可以在.dmydmYdmydmylubridate

在这里，我展示了一些示例，其中我尝试将格式上的日期dmy与其他格式一起解析，并orders进行相应的指定。

library(lubridate)
# version: lubridate_1.3.0

# regarding how date format is specified in 'orders':
# examples in ?parse_date_time
# parse_date_time(x, "ymd")
# parse_date_time(x, "%y%m%d")
# parse_date_time(x, "%y %m %d")
# these order strings are equivalent and parses the same way
# "Formatting orders might include arbitrary separators. These are discarded"

# dmy date only
parse_date_time(x = "27-09-13", orders = "d m y")
# [1] "2013-09-27 UTC"
# OK

# dmy & dBY
parse_date_time(c("27-09-13", "27 September 2013"), orders = c("d m y", "d B Y"))
# [1] "2013-09-27 UTC" "2013-09-27 UTC"
# OK

# dmy & dbY
parse_date_time(c("27-09-13", "27 Sep 2013"), orders = c("d m y", "d b Y"))
# [1] "2013-09-27 UTC" "2013-09-27 UTC"
# OK

# dmy & dmY
parse_date_time(c("27-09-13", "27-09-2013"), orders = c("d m y", "d m Y"))
# [1] "0013-09-27 UTC" "2013-09-27 UTC"
# not OK

# does order of the date components matter?
parse_date_time(c("2013-09-27", "13-09-13"), orders = c("Y m d", "y m d"))
# [1] "2013-09-27 UTC" "0013-09-27 UTC"
# no

select_formats论据呢？很抱歉这么说，但我很难理解帮助文件的这一部分。并搜索select_formatsSO : 0 结果。不过，这部分似乎是相关的：“默认情况下，选择具有最多格式化令牌 (%) 的格式，并且 %Y 计为 2.5 个令牌（因此它可以优先于 %y%m）。”。所以我（拼命地）尝试了一些额外的dmy日期：

parse_date_time(c("27-09-2013", rep("27-09-13", 10)), orders = c("d m y", "d m Y"))
# not OK. Tried also 100 dmy dates.

# does order in the vector matter?
parse_date_time(c(rep("27-09-13", 10), "27-09-2013"), orders = c("d m y", "d m Y"))
# no

然后，我检查了该guess_formats函数（也在中lubridate）如何与以下内容dmy一起处理dmY：

guess_formats(c("27-09-13", "27-09-2013"), c("dmy", "dmY"), print_matches = TRUE)
#                   dmy        dmY       
# [1,] "27-09-13"   "%d-%m-%y" ""        
# [2,] "27-09-2013" "%d-%m-%Y" "%d-%m-%Y"
# OK

来自?guess_formats：y also matches Y。来自?parse_date_time：y* Year without century (00–99 or 0–99). Also matches year with century (Y format)。所以我尝试了：

guess_formats(c("27-09-13", "27-09-2013"), c("dmy"), print_matches = TRUE)
#                   dmy       
# [1,] "27-09-13"   "%d-%m-%y"
# [2,] "27-09-2013" "%d-%m-%Y"
# OK

因此，guess_format似乎可以与dmy一起处理dmY。但是我怎么能告诉我parse_date_time也这样做呢？提前感谢您的任何评论或帮助。

更新我在lubridate错误报告上发布了问题，并得到了@vitoshka 的快速回复：“这是一个错误”。

score 3 · Accepted Answer

它看起来像一个错误。我不确定所以你应该联系维护者。

构建包源并更改此内部函数中的一行（我替换which.max为wich.min）：

.select_formats <-   function(trained){
  n_fmts <- nchar(gsub("[^%]", "", names(trained))) + grepl("%Y", names(trained))*1.5
  names(trained[ which.min(n_fmts) ]) ## replace which.max  by which.min
}

似乎纠正了这个问题。坦率地说，我不知道为什么会这样，但我想这是一种排名..

parse_date_time(c("27-09-13", "27-09-2013"), orders = c("d m y", "d m Y"))
[1] "2013-09-27 UTC" "2013-09-27 UTC"

parse_date_time(c("2013-09-27", "13-09-13"), orders = c("Y m d", "y m d"))
[1] "2013-09-27 UTC" "2013-09-13 UTC"

score 1 · Accepted Answer

这实际上是故意的。我现在想起来了。假设如果您在同一向量中有 01-02-1845 和 01-02-03 形式的日期，那么它可能是 01-02-0003 的意思。它还避免与不同世纪的日期混淆。你不知道17-05-13是指 20 世纪还是 21 世纪。

这个决定也可能有技术原因，但我现在不记得了。

.select_formats争论是要走的路：

my_select <-   function(trained){
  n_fmts <- nchar(gsub("[^%]", "", names(trained))) +
    grepl("%y", names(trained))*1.5
  names(trained[ which.max(n_fmts) ])
}

parse_date_time(c("27-09-13", "27-09-2013"), "dmy", select_formats = my_select)
## [1] "2013-09-27 UTC" "2013-09-27 UTC"

select_formats应该返回要按顺序应用于输入字符向量的格式。在上面的示例中，您优先考虑 %y 格式。

我将此示例添加到文档中。

r - 使用 parse_date_time 将 dmy 格式的日期与 dmY 一起解析

2 回答 2

Related

Reference