r - 如何在 readin 时向 disk.frame 的 inmapfn 输入单个附加参数？

Question

根据文章https://diskframe.com/articles/ingesting-data.html，inmapfn的一个很好的用例csv_to_disk_frame(...)是日期转换的一部分。在我的数据中，我知道运行时日期列的名称，并希望将日期输入到读取时间函数的转换。我遇到的一个问题是，除了块本身之外，似乎没有任何其他参数可以传递到 inmapfn 参数中。我不能在运行时使用硬编码变量，因为直到运行时才知道列的名称。

为了澄清这个问题，inmapfn 似乎在自己的环境中运行，以防止任何数据竞争/其他并行化问题，但我知道变量不会改变，所以我希望有办法覆盖它，因为我可以确保这是安全的。

我知道我调用的函数在任意数据帧上调用时有效。

我在下面提供了一个可重现的示例。

library(tidyverse)
library(disk.frame)

setup_disk.frame()

a <- tribble(~dates, ~val,
             "09feb2021", 2,
             "21feb2012", 2,
             "09mar2013", 3,
             "20apr2021", 4,
)

write_csv(a, "a.csv")

dates_col <- "dates"

tmp.df <- csv_to_disk.frame(
  "a.csv",
  outdir = file.path(tempdir(), "tmp.df"),
  in_chunk_size = 1L, 
  inmapfn = function(chunk) {
    chunk[, sdate := as.Date(do.call(`$`, list(chunk,dates_col)), "%d%b%Y")]
  }
)
#>  -----------------------------------------------------
#> Stage 1 of 2: splitting the file a.csv into smallers files:
#> Destination: C:\Users\joelk\AppData\Local\Temp\RtmpcFBBkr\file4a1876e87bf5
#>  -----------------------------------------------------
#> Stage 1 of 2 took: 0.020s elapsed (0.000s cpu)
#>  -----------------------------------------------------
#> Stage 2 of 2: Converting the smaller files into disk.frame
#>  -----------------------------------------------------
#> csv_to_disk.frame: Reading multiple input files.
#> Please use `colClasses = `  to set column types to minimize the chance of a failed read
#> =================================================
#> 
#>  -----------------------------------------------------
#> -- Converting CSVs to disk.frame -- Stage 1 of 2:
#> 
#> Converting 5 CSVs to 6 disk.frames each consisting of 6 chunks
#> 
#> Error in do.call(`$`, list(chunk, dates_col)): object 'dates_col' not found

score 2 · Accepted Answer

您可以尝试不同的参数backend和chunk_reader参数。例如，如果您将设置backend为readr，则inmapfn用户定义的函数将有权访问先前定义的变量。此外，readr如果将字符串格式识别为日期，将进行列类型猜测并自动估算日期类型列（但是，在您的示例数据中，它不会将其识别为日期类型）。

如果您出于性能原因不想使用 readr 后端，那么我会问您的示例是否正确代表了您的实际情况？在您提供的示例中，我没有看到需要将日期列作为变量传递。

您提供的链接的即时转换部分中有一个可行的解决方案，我没有看到该示例与您的示例之间有任何额外的复杂性。

如果您确实需要使用默认值backend和chunk_reader计划并且您确实需要向 inmapfn 函数发送先前定义的变量，则可以将csv_to_disk.frame调用包装在包装器函数中：

library(disk.frame)

setup_disk.frame()

df <- tribble(~dates, ~val,
              "09feb2021", 2,
              "21feb2012", 2,
              "09mar2013", 3,
              "20apr2021", 4,
)

write.csv(df, file.path(tempdir(), "df.csv"), row.names = FALSE)

wrap_csv_to_disk <- function(col) {
  
  my_date_col <- col
  
  csv_to_disk.frame(
    file.path(tempdir(), "df.csv"), 
    in_chunk_size = 1L,
    inmapfn = function(chunk, dates = my_date_col) {
      chunk[, dates] <- lubridate::dmy(chunk[[dates]])
      chunk
    })
}

date_col <- "dates"

df_disk_frame <- wrap_csv_to_disk(date_col)

#> str(collect(df_disk_frame)$dates)
# Date[1:4], format: "2021-02-09" "2012-02-21" "2013-03-09" "2021-04-20"

score 0 · Accepted Answer

我懂了。为了解决这个问题，可以做这样的事情吗？

date_var = knonw_at_runtime()
saveRDS(date_var, "some/path/date_var.rds")

a = csv_to_disk.frame(files, inmapfn = function(chunk) {
   date_var = readRDS("some/path/date_var.rds")
   # do the rest
})

我认为让inmapfn其他选项是可行的，请参阅https://github.com/xiaodaigh/disk.frame/issues/377进行跟踪

r - 如何在 readin 时向 disk.frame 的 inmapfn 输入单个附加参数？

2 回答 2

Related

Reference