0

我正在尝试建立一个工作流程,其中涉及下载 zip 文件、提取其内​​容并将函数应用于其每个文件。

我遇到了一些问题:

  1. 如何可重现地设置空文件系统?也就是说,我希望能够创建一个空目录系统,稍后将文件下载到该目录中。理想情况下,我想做类似的事情tar_target(my_dir, fs::dir_create("data"), format = "file"),但我从文档中知道空目录不能与 format = "file" 一起使用。我知道我可以dir_create在每个需要它的情况下做一个,但这似乎很笨拙。

  2. 在下面的代表中,我想使用pattern = map(x). 正如错误所暗示的那样,我需要为父目标指定一个模式,因为format = "file". 您可以看到,如果我确实为父目标指定了模式,我将再次需要为其父目标执行此操作。据我所知,无法为没有父母的目标设定模式(但我之前已经错了很多次)。

我有一种感觉,这一切都错了——谢谢你的时间。

library(targets)
tar_script({
    tar_option_set(packages = c("tidyverse", "fs"))
    download_file <- function(url, dest) {
        download.file(url, dest)
        dest
    }
    do_stuff <- function(file_path) {
        fs::file_copy(file_path, file_path, overwrite = TRUE)
    }
    list(
      tar_target(downloaded_zip, 
                 download_file("https://file-examples-com.github.io/uploads/2017/02/zip_2MB.zip", 
                               path(dir_create("data"), "file", ext = "zip")), 
                 format = "file"), 
 
      tar_target(extracted_files, 
                 unzip(downloaded_zip, exdir = dir_create("data")), 
                 format = "file"), 

      tar_target(stuff_done, 
                 do_stuff(extracted_files), 
                 pattern = map(extracted_files), format = "file", 
                 iteration = "list"))
})
tar_make()
#> * start target downloaded_zip
#> trying URL 'https://file-examples-com.github.io/uploads/2017/02/zip_2MB.zip'
#> Content type 'application/zip' length 2036861 bytes (1.9 MB)
#> ==================================================
#> downloaded 1.9 MB
#> 
#> * built target downloaded_zip
#> * start target extracted_files
#> * built target extracted_files
#> * end pipeline
#> Error : Target stuff_done tried to branch over extracted_files, which is illegal. Patterns must only branch over explicitly declared targets in the pipeline. Stems and patterns are fine, but you cannot branch over branches or global objects. Also, if you branch over a target with format = "file", then that target must also be a pattern.
#> Error: callr subprocess failed: Target stuff_done tried to branch over extracted_files, which is illegal. Patterns must only branch over explicitly declared targets in the pipeline. Stems and patterns are fine, but you cannot branch over branches or global objects. Also, if you branch over a target with format = "file", then that target must also be a pattern.
#> Visit https://books.ropensci.org/targets/debugging.html for debugging advice.

reprex 包于 2021-12-08 创建(v2.0.1)

4

1 回答 1

1

原始答案

这是一个想法:您可以跟踪该 URL,format = "url"然后使该 URL 成为所有文件分支的依赖项。下面,所有的都files应该重新运行,然后上游在线数据发生变化。这很好,因为所做的只是重新散列一些东西。stuff_done但是,如果只有其中一些文件实际更改,则并非所有分支都应该运行。

编辑

再想一想,我们可能需要批量散列本地文件。不是最有效的,但它可以完成工作。targets希望您使用自己的内置存储系统而不是外部文件,因此如果您可以将数据读入并以非文件格式返回,动态分支会更容易。

# _targets.R file
library(targets)
tar_option_set(packages = c("tidyverse", "fs"))
download_file <- function(url, dest) {
  download.file(url, dest)
  dest
}
do_stuff <- function(file_path) {
  file.info(file_path)
}
download_and_unzip <- function(url) {
  downloaded_zip <- tempfile()
  download_file(url, downloaded_zip)
  unzip(downloaded_zip, exdir = dir_create("data"))
}
list(
  tar_target(
    url,
    "https://file-examples-com.github.io/uploads/2017/02/zip_2MB.zip",
    format = "url"
  ),
  tar_target(
    files_bulk,
    download_and_unzip(url),
    format = "file"
  ),
  tar_target(file_names, files_bulk), # not a format = "file" target
  tar_target(
    files, {
      files-bulk # Re-hash all the files separately if any file changes.
      file_names
    },
    pattern = map(file_names),
    format = "file"
  ),
  tar_target(stuff_done, do_stuff(files), pattern = map(files))
)
于 2021-12-09T16:42:33.143 回答