1

{targets}在数据分析项目中使用该包,我需要从远程 Oracle 数据库中获取数据集。

我的首要任务是双重的:

  1. 从分散在 Oracle DB 中的不同视图表中获取数据。
  2. 根据我从 Oracle 提取的数据计算和整理一个新的数据列。
  • 这些计算列虽然来自不同的数据库表,但确实有足够的共同点,因此我可以通过一些相关的索引列合并它们。

我计算的每个新列都是一个“特殊的雪花”,因此我为每个列构建了一个专用的争吵函数,以及一个调用每个雪花争吵子函数的高阶函数。

我的问题是对使用targets::tar_target().


可重现的例子

为了准确传达我的问题,不幸的是,我需要在此示例中投入大量代码。第一部分仅用于生成演示数据并模仿 Oracle DB。您可以只运行代码并跳到它后面的部分。

1.模拟数据:建立一个有4张表的数据库
只需运行这段代码;对于理解问题并不重要

library(dplyr, warn.conflicts = FALSE)
library(babynames)
library(DBI)
library(RSQLite)

set.seed(2021)

simulate_df_from_colnames <- function(vec_of_colnames, desired_nrows, vec_of_ids) {
  
  stopifnot(desired_nrows == length(vec_of_ids))
  
  ncols        <- length(vec_of_colnames)
  n_values     <- ncols * desired_nrows
  vec          <- runif(n = n_values, min = 1, max = 100)
  vec[sample(1:length(vec), 0.2 * length(vec))] <- NA # sprinkle NA randomly in 20% of values
  mat          <- matrix(vec, ncol = ncols)
  df           <- as.data.frame(mat)
  colnames(df) <- vec_of_colnames
  df$id        <- vec_of_ids
  df           <- df[,c(ncol(df),1:(ncol(df)-1))] # so the id column move from last to first position

  return(df)
}

work_related <- c("acceptance", "accountability", "achievement", "adaptability", "adventure", "authenticity", "authority", "autonomy", "balance", "boldness", "bravery", "candor", "challenge", "clarity", "collaboration", "compassion", "communication", "community", "contribution", "creativity", "curiosity", "dependability", "determination", "diversity", "empathy", "enthusiasm", "equality", "family", "fairness", "flexibility", "friendship", "growth", "happiness", "hard_work", "honesty", "humility", "humor", "impact", "improvement", "ingenuity", "innovation", "kindness", "knowledge", "leadership", "learning", "loyalty", "meaningful_work", "optimism", "ownership", "participation", "patience", "peace", "persistence", "popularity", "power", "quality", "recognition", "relationships", "reliability", "reputation", "respect", "responsibility", "results", "security", "self_improvement", "simplicity", "spirituality", "stability", "success", "sustainability", "teamwork", "tenacity", "time_management", "transparency", "trustworthiness", "wealth", "wisdom", "work_ethic", "work_life_balance")
blood_tests  <- c("white_blood_cell_count", "red_blood_cell_count", "hemoglobin", "hematocrit", "mean_corpuscular_volume", "platelet_count", "sodium", "potassium", "chloride", "carbon_dioxide", "blood_urea_nitrogen", "creatinine", "glucose", "calcium", "total_protein", "albumin", "bilirubin", "alkaline_phosphatase", "ast", "alt", "vitamin_b_12", "methylmalonic_acid", "ferritin")
physical     <- c("systolic_blood_pressure", "diastolic_blood_pressure", "pulse_rate_beats_minute", "height", "weight", "bmi", "waist_circumference", "hip_circumference")
psych_traits <- c("accessible", "active", "adaptable", "admirable", "adventurous", "agreeable", "alert", "allocentric", "amiable", "anticipative", "appreciative", "articulate", "aspiring", "athletic", "attractive", "balanced", "benevolent", "brilliant", "calm", "capable", "captivating", "caring", "challenging", "charismatic", "charming", "cheerful", "clean", "clear_headed", "clever", "colorful", "companionly", "compassionate", "conciliatory", "confident", "conscientious", "abrasive", "abrupt", "agonizing", "aimless", "airy", "aloof", "amoral", "angry", "anxious", "apathetic", "arbitrary", "argumentative", "arrogantt", "artificial", "asocial", "assertive", "astigmatic", "barbaric", "bewildered", "bizarre", "bland", "blunt", "biosterous", "brittle", "brutal", "calculating", "callous", "cantakerous", "careless", "cautious", "charmless", "childish", "clumsy", "coarse", "cold")

my_names <- 
  babynames::babynames %>%
  pull(name) %>%
  unique() %>%
  sample(1000)

df_work_related <- simulate_df_from_colnames(work_related, 1000, vec_of_ids = my_names)
df_blood_tests  <- simulate_df_from_colnames(blood_tests , 1000, vec_of_ids = my_names)
df_physical     <- simulate_df_from_colnames(physical    , 1000, vec_of_ids = my_names)
df_psych_traits <- simulate_df_from_colnames(psych_traits, 1000, vec_of_ids = my_names)

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

copy_to(con, df_work_related, name = "DJLNGJN3445_NFKS")
copy_to(con, df_blood_tests , name = "DKFMDGNSQWRE_320586")
copy_to(con, df_physical    , name = "KLDJNSDOIJFW_295868FJDI")
copy_to(con, df_psych_traits, name = "AQNF_223_daqVV")

reprex 包于 2021-10-23 创建(v2.0.1)


2.我的问题从这里开始;我有一个数据库,里面装满了我想分析的数据。
从上面运行模拟代码后,我们现在有了con代表远程数据库的对象。我们可以探索里面有哪些表con

DBI::dbListObjects(con)
#>                                  table is_prefix
#> 1          <Id> table = AQNF_223_daqVV     FALSE
#> 2        <Id> table = DJLNGJN3445_NFKS     FALSE
#> 3     <Id> table = DKFMDGNSQWRE_320586     FALSE
#> 4 <Id> table = KLDJNSDOIJFW_295868FJDI     FALSE
#> 5            <Id> table = sqlite_stat1     FALSE
#> 6            <Id> table = sqlite_stat4     FALSE

授予数据库访问权限的人还告诉我们,它存储了大约 1000 人的数据,分散在 4 个不同的表中。

表名 存储在那里的数据类型
AQNF_223_daqVV 心理测量
DJLNGJN3445_NFKS 与就业有关的测量
DKFMDGNSQWRE_320586 验血
KLDJNSDOIJFW_295868FJDI 物理测量,例如身高、体重等。

3. 以一个新变量为例进行演练
假设我想计算一个新变量来表示与一个人相处是否有趣。从上表中我看到表名"AQNF_223_daqVV"包含心理测量,所以我理解它是一个相关的表。探索该数据后,我决定我的新变量"fun_to_be_with"将是现有变量accessibleactive和的平均值adaptable

library(dplyr)

compute_fun_to_be_with <- function(.dat) {
  .dat %>%
    select(id, accessible, active, adaptable) %>%
    mutate(fun_to_be_with = rowMeans(across(c(accessible, active, adaptable))), .keep = "unused")
}
tbl(con, "AQNF_223_daqVV") %>%
  collect() %>%
  compute_fun_to_be_with()
#> # A tibble: 1,000 x 2
#>    id        fun_to_be_with
#>    <chr>              <dbl>
#>  1 Miari               NA  
#>  2 Demariana           NA  
#>  3 Halah               NA  
#>  4 Abdalah             NA  
#>  5 Infiniti            NA  
#>  6 Sydel               63.0
#>  7 Montelle            62.8
#>  8 Rhys                NA  
#>  9 Mijah               73.0
#> 10 Lamontre            NA  
#> # ... with 990 more rows

当我开始探索我可以计算的更有趣的变量时,我开始意识到除了初始计算(例如,取平均值)之外,还有几个适用于某些变量但不适用于其他变量的争论步骤。例如,有时我想向上或向下舍入变量的值,或者取对数,或者其他。所以我计算的每个新变量都是一个“特殊的雪花”,我有一个包装函数来协调所有这些偏好。

compute_snowflake <- function(.dat, snowflake_name) {
  
  switch(snowflake_name,
         "fun_to_be_with" = compute_fun_to_be_with(.dat))
}


wrangle_snowflake <- function(snowflake_name, 
                              raw_data_from_db, 
                              replace_na_with_zero, 
                              take_logarithm, 
                              round = c("up", "down"), 
                              standardize_as_zscore) {

  
  raw_data_from_db %>%
    compute_snowflake(snowflake_name) %>%
    {if (replace_na_with_zero) mutate(., across({{ snowflake_name }}, tidyr::replace_na, 0)) else .} %>%
    {if (take_logarithm) mutate(., across(fun_to_be_with, log)) else .} %>%
    {if (round == "up") mutate(., across(fun_to_be_with, ceiling)) else .} %>%
    {if (round == "down") mutate(., across(fun_to_be_with, floor)) else .} %>%
    {if (standardize_as_zscore)  mutate(., across(fun_to_be_with, scale)) else .}
} 

3a) 迭代的需要
对于只处理一个新变量,wrangle_snowflake ()可以按原样使用:

wrangle_snowflake(snowflake_name = "fun_to_be_with", 
                  raw_data_from_db = tbl(con, "AQNF_223_daqVV") %>% collect(),
                  replace_na_with_zero = FALSE,
                  take_logarithm = TRUE,
                  round = "down",
                  standardize_as_zscore = FALSE)

但问题在于规模。我的项目需要处理大约 100 个新变量。而且我不想以wrangle_snowflake()这种方式重复 100 次。如果我们只是在本地将 DB 表保存为环境中的对象,我们可以purrr::pmap()很好地利用它进行迭代:

raw_tbl_psych <- 
  tbl(con, "AQNF_223_daqVV") %>% 
  collect()

tbl_parameters <- 
  tibble::tribble(~snowflake_name, ~raw_data_from_db, ~replace_na_with_zero, ~take_logarithm, ~round, ~standardize_as_zscore,
                "fun_to_be_with", raw_tbl_psych, FALSE, TRUE, "down", FALSE)
tbl_parameters
#> # A tibble: 1 x 6
#>   snowflake_name raw_data_from_db      replace_na_with_zero take_logarithm round
#>   <chr>          <list>                <lgl>                <lgl>          <chr>
#> 1 fun_to_be_with <tibble [1,000 x 71]> FALSE                TRUE           down 
#> # ... with 1 more variable: standardize_as_zscore <lgl>

tbl_parameters %>%
  purrr::pmap(.f = wrangle_snowflake)
#> [[1]]
#> # A tibble: 1,000 x 2
#>    id        fun_to_be_with
#>    <chr>              <dbl>
#>  1 Miari                 NA
#>  2 Demariana             NA
#>  3 Halah                 NA
#>  4 Abdalah               NA
#>  5 Infiniti              NA
#>  6 Sydel                  4
#>  7 Montelle               4
#>  8 Rhys                  NA
#>  9 Mijah                  4
#> 10 Lamontre              NA
#> # ... with 990 more rows

pmap()非常强大,因为我可以扩展tbl_parameters和添加更多的雪花,但调用tbl_parameters %>% purrr::pmap(.f = wrangle_snowflake)将保持不变。

3b){targets}
这个例子中没有反映一个主要问题:我需要从远程数据库中获取的数据是巨大的。每个表(例如,AQNF_223_daqVV)的范围可以是 1-10 百万行。在这种情况下,我不想将整个数据作为 R 对象加载到环境中。相反,该{targets}包允许我为每个巨型表创建一个“目标”,该表作为.rds文件存储在目录中。这样我就可以间接使用巨型表,而无需加载它。

最后:我的问题

{targets}不适用于我的pmap()方法。由于我不想将整个巨大的数据表带到 R 的环境中,我宁愿简单地用它们的名字来引用它们。这样,我tbl_parameters_2看起来像:

tbl_parameters_2 <-
  tibble::tribble(
      ~snowflake_name,                  ~db_name, ~replace_na_with_zero, ~take_logarithm, ~round, ~standardize_as_zscore,
     "fun_to_be_with",          "AQNF_223_daqVV",                 FALSE,            TRUE, "down",                  FALSE,
        "work_ethics",        "DJLNGJN3445_NFKS",                  TRUE,            TRUE,   "up",                  FALSE,
                "bmi", "KLDJNSDOIJFW_295868FJDI",                 FALSE,           FALSE,   "up",                   TRUE,
  "risk_for_diabetes",     "DKFMDGNSQWRE_320586",                 FALSE,           FALSE, "down",                  FALSE
  )

但!{targets} 不允许通过字符串引用现有目标

因此,如果我使用targets为每个数据库表创建一个目标:

library(targets)

tar_target(raw_tbl_psych,  tbl(con, "AQNF_223_daqVV") %>% collect())
tar_target(raw_tbl_work,   tbl(con, "DJLNGJN3445_NFKS") %>% collect())
tar_target(raw_tbl_physical, tbl(con, "KLDJNSDOIJFW_295868FJDI") %>% collect())
tar_target(raw_tbl_blood,  tbl(con, "DKFMDGNSQWRE_320586") %>% collect())

然后想要pmap()迭代tbl_parameters_2并为每个字符串db_name替换它与相应的目标,那么它不会工作。

swap_table_ugly_name_for_nice_target_name <- function(ugly_name) {
  
  switch(ugly_name,
         # ugly_name               # targets name
         "AQNF_223_daqVV"          = "raw_tbl_psych",
         "DJLNGJN3445_NFKS"        = "raw_tbl_work",
         "KLDJNSDOIJFW_295868FJDI" = "raw_tbl_physical",
         "DKFMDGNSQWRE_320586"     = "raw_tbl_blood"
         )
}
tar_target(list_of_wrangled_snowflakes,
           wrangle_snowflake(snowflake_name        = tbl_parameters_2$snowflake_name,
                             db_name               = swap_table_ugly_name_for_nice_target_name(tbl_parameters_2$db_name),
                             replace_na_with_zero  = tbl_parameters_2$replace_na_with_zero,
                             take_logarithm        = tbl_parameters_2$take_logarithm,
                             round                 = tbl_parameters_2$round,
                             standardize_as_zscore = tbl_parameters_2$standardize_as_zscore)
)

好吧,它只是行不通。根据@landau,这是因为:

targets使用静态代码分析检测依赖关系


对于那些到目前为止阅读的人,也许您知道如何结合迭代和引用预先存在的目标?

4

0 回答 0