我{targets}
在数据分析项目中使用该包,我需要从远程 Oracle 数据库中获取数据集。
我的首要任务是双重的:
- 从分散在 Oracle DB 中的不同视图和表中获取数据。
- 根据我从 Oracle 提取的数据计算和整理一个新的数据列。
- 这些计算列虽然来自不同的数据库表,但确实有足够的共同点,因此我可以通过一些相关的索引列合并它们。
我计算的每个新列都是一个“特殊的雪花”,因此我为每个列构建了一个专用的争吵函数,以及一个调用每个雪花争吵子函数的高阶函数。
我的问题是对使用targets::tar_target()
.
可重现的例子
为了准确传达我的问题,不幸的是,我需要在此示例中投入大量代码。第一部分仅用于生成演示数据并模仿 Oracle DB。您可以只运行代码并跳到它后面的部分。
1.模拟数据:建立一个有4张表的数据库
只需运行这段代码;对于理解问题并不重要
library(dplyr, warn.conflicts = FALSE)
library(babynames)
library(DBI)
library(RSQLite)
set.seed(2021)
simulate_df_from_colnames <- function(vec_of_colnames, desired_nrows, vec_of_ids) {
stopifnot(desired_nrows == length(vec_of_ids))
ncols <- length(vec_of_colnames)
n_values <- ncols * desired_nrows
vec <- runif(n = n_values, min = 1, max = 100)
vec[sample(1:length(vec), 0.2 * length(vec))] <- NA # sprinkle NA randomly in 20% of values
mat <- matrix(vec, ncol = ncols)
df <- as.data.frame(mat)
colnames(df) <- vec_of_colnames
df$id <- vec_of_ids
df <- df[,c(ncol(df),1:(ncol(df)-1))] # so the id column move from last to first position
return(df)
}
work_related <- c("acceptance", "accountability", "achievement", "adaptability", "adventure", "authenticity", "authority", "autonomy", "balance", "boldness", "bravery", "candor", "challenge", "clarity", "collaboration", "compassion", "communication", "community", "contribution", "creativity", "curiosity", "dependability", "determination", "diversity", "empathy", "enthusiasm", "equality", "family", "fairness", "flexibility", "friendship", "growth", "happiness", "hard_work", "honesty", "humility", "humor", "impact", "improvement", "ingenuity", "innovation", "kindness", "knowledge", "leadership", "learning", "loyalty", "meaningful_work", "optimism", "ownership", "participation", "patience", "peace", "persistence", "popularity", "power", "quality", "recognition", "relationships", "reliability", "reputation", "respect", "responsibility", "results", "security", "self_improvement", "simplicity", "spirituality", "stability", "success", "sustainability", "teamwork", "tenacity", "time_management", "transparency", "trustworthiness", "wealth", "wisdom", "work_ethic", "work_life_balance")
blood_tests <- c("white_blood_cell_count", "red_blood_cell_count", "hemoglobin", "hematocrit", "mean_corpuscular_volume", "platelet_count", "sodium", "potassium", "chloride", "carbon_dioxide", "blood_urea_nitrogen", "creatinine", "glucose", "calcium", "total_protein", "albumin", "bilirubin", "alkaline_phosphatase", "ast", "alt", "vitamin_b_12", "methylmalonic_acid", "ferritin")
physical <- c("systolic_blood_pressure", "diastolic_blood_pressure", "pulse_rate_beats_minute", "height", "weight", "bmi", "waist_circumference", "hip_circumference")
psych_traits <- c("accessible", "active", "adaptable", "admirable", "adventurous", "agreeable", "alert", "allocentric", "amiable", "anticipative", "appreciative", "articulate", "aspiring", "athletic", "attractive", "balanced", "benevolent", "brilliant", "calm", "capable", "captivating", "caring", "challenging", "charismatic", "charming", "cheerful", "clean", "clear_headed", "clever", "colorful", "companionly", "compassionate", "conciliatory", "confident", "conscientious", "abrasive", "abrupt", "agonizing", "aimless", "airy", "aloof", "amoral", "angry", "anxious", "apathetic", "arbitrary", "argumentative", "arrogantt", "artificial", "asocial", "assertive", "astigmatic", "barbaric", "bewildered", "bizarre", "bland", "blunt", "biosterous", "brittle", "brutal", "calculating", "callous", "cantakerous", "careless", "cautious", "charmless", "childish", "clumsy", "coarse", "cold")
my_names <-
babynames::babynames %>%
pull(name) %>%
unique() %>%
sample(1000)
df_work_related <- simulate_df_from_colnames(work_related, 1000, vec_of_ids = my_names)
df_blood_tests <- simulate_df_from_colnames(blood_tests , 1000, vec_of_ids = my_names)
df_physical <- simulate_df_from_colnames(physical , 1000, vec_of_ids = my_names)
df_psych_traits <- simulate_df_from_colnames(psych_traits, 1000, vec_of_ids = my_names)
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, df_work_related, name = "DJLNGJN3445_NFKS")
copy_to(con, df_blood_tests , name = "DKFMDGNSQWRE_320586")
copy_to(con, df_physical , name = "KLDJNSDOIJFW_295868FJDI")
copy_to(con, df_psych_traits, name = "AQNF_223_daqVV")
由reprex 包于 2021-10-23 创建(v2.0.1)
2.我的问题从这里开始;我有一个数据库,里面装满了我想分析的数据。
从上面运行模拟代码后,我们现在有了con
代表远程数据库的对象。我们可以探索里面有哪些表con
:
DBI::dbListObjects(con)
#> table is_prefix
#> 1 <Id> table = AQNF_223_daqVV FALSE
#> 2 <Id> table = DJLNGJN3445_NFKS FALSE
#> 3 <Id> table = DKFMDGNSQWRE_320586 FALSE
#> 4 <Id> table = KLDJNSDOIJFW_295868FJDI FALSE
#> 5 <Id> table = sqlite_stat1 FALSE
#> 6 <Id> table = sqlite_stat4 FALSE
授予数据库访问权限的人还告诉我们,它存储了大约 1000 人的数据,分散在 4 个不同的表中。
表名 | 存储在那里的数据类型 |
---|---|
AQNF_223_daqVV | 心理测量 |
DJLNGJN3445_NFKS | 与就业有关的测量 |
DKFMDGNSQWRE_320586 | 验血 |
KLDJNSDOIJFW_295868FJDI | 物理测量,例如身高、体重等。 |
3. 以一个新变量为例进行演练
假设我想计算一个新变量来表示与一个人相处是否有趣。从上表中我看到表名"AQNF_223_daqVV"
包含心理测量,所以我理解它是一个相关的表。探索该数据后,我决定我的新变量"fun_to_be_with"
将是现有变量accessible
、active
和的平均值adaptable
。
library(dplyr)
compute_fun_to_be_with <- function(.dat) {
.dat %>%
select(id, accessible, active, adaptable) %>%
mutate(fun_to_be_with = rowMeans(across(c(accessible, active, adaptable))), .keep = "unused")
}
tbl(con, "AQNF_223_daqVV") %>%
collect() %>%
compute_fun_to_be_with()
#> # A tibble: 1,000 x 2
#> id fun_to_be_with
#> <chr> <dbl>
#> 1 Miari NA
#> 2 Demariana NA
#> 3 Halah NA
#> 4 Abdalah NA
#> 5 Infiniti NA
#> 6 Sydel 63.0
#> 7 Montelle 62.8
#> 8 Rhys NA
#> 9 Mijah 73.0
#> 10 Lamontre NA
#> # ... with 990 more rows
当我开始探索我可以计算的更有趣的变量时,我开始意识到除了初始计算(例如,取平均值)之外,还有几个适用于某些变量但不适用于其他变量的争论步骤。例如,有时我想向上或向下舍入变量的值,或者取对数,或者其他。所以我计算的每个新变量都是一个“特殊的雪花”,我有一个包装函数来协调所有这些偏好。
compute_snowflake <- function(.dat, snowflake_name) {
switch(snowflake_name,
"fun_to_be_with" = compute_fun_to_be_with(.dat))
}
wrangle_snowflake <- function(snowflake_name,
raw_data_from_db,
replace_na_with_zero,
take_logarithm,
round = c("up", "down"),
standardize_as_zscore) {
raw_data_from_db %>%
compute_snowflake(snowflake_name) %>%
{if (replace_na_with_zero) mutate(., across({{ snowflake_name }}, tidyr::replace_na, 0)) else .} %>%
{if (take_logarithm) mutate(., across(fun_to_be_with, log)) else .} %>%
{if (round == "up") mutate(., across(fun_to_be_with, ceiling)) else .} %>%
{if (round == "down") mutate(., across(fun_to_be_with, floor)) else .} %>%
{if (standardize_as_zscore) mutate(., across(fun_to_be_with, scale)) else .}
}
3a) 迭代的需要
对于只处理一个新变量,wrangle_snowflake ()
可以按原样使用:
wrangle_snowflake(snowflake_name = "fun_to_be_with",
raw_data_from_db = tbl(con, "AQNF_223_daqVV") %>% collect(),
replace_na_with_zero = FALSE,
take_logarithm = TRUE,
round = "down",
standardize_as_zscore = FALSE)
但问题在于规模。我的项目需要处理大约 100 个新变量。而且我不想以wrangle_snowflake()
这种方式重复 100 次。如果我们只是在本地将 DB 表保存为环境中的对象,我们可以purrr::pmap()
很好地利用它进行迭代:
raw_tbl_psych <-
tbl(con, "AQNF_223_daqVV") %>%
collect()
tbl_parameters <-
tibble::tribble(~snowflake_name, ~raw_data_from_db, ~replace_na_with_zero, ~take_logarithm, ~round, ~standardize_as_zscore,
"fun_to_be_with", raw_tbl_psych, FALSE, TRUE, "down", FALSE)
tbl_parameters
#> # A tibble: 1 x 6
#> snowflake_name raw_data_from_db replace_na_with_zero take_logarithm round
#> <chr> <list> <lgl> <lgl> <chr>
#> 1 fun_to_be_with <tibble [1,000 x 71]> FALSE TRUE down
#> # ... with 1 more variable: standardize_as_zscore <lgl>
tbl_parameters %>%
purrr::pmap(.f = wrangle_snowflake)
#> [[1]]
#> # A tibble: 1,000 x 2
#> id fun_to_be_with
#> <chr> <dbl>
#> 1 Miari NA
#> 2 Demariana NA
#> 3 Halah NA
#> 4 Abdalah NA
#> 5 Infiniti NA
#> 6 Sydel 4
#> 7 Montelle 4
#> 8 Rhys NA
#> 9 Mijah 4
#> 10 Lamontre NA
#> # ... with 990 more rows
pmap()
非常强大,因为我可以扩展tbl_parameters
和添加更多的雪花,但调用tbl_parameters %>% purrr::pmap(.f = wrangle_snowflake)
将保持不变。
3b){targets}
这个例子中没有反映一个主要问题:我需要从远程数据库中获取的数据是巨大的。每个表(例如,AQNF_223_daqVV
)的范围可以是 1-10 百万行。在这种情况下,我不想将整个数据作为 R 对象加载到环境中。相反,该{targets}
包允许我为每个巨型表创建一个“目标”,该表作为.rds
文件存储在目录中。这样我就可以间接使用巨型表,而无需加载它。
最后:我的问题
{targets}
不适用于我的pmap()
方法。由于我不想将整个巨大的数据表带到 R 的环境中,我宁愿简单地用它们的名字来引用它们。这样,我tbl_parameters_2
看起来像:
tbl_parameters_2 <-
tibble::tribble(
~snowflake_name, ~db_name, ~replace_na_with_zero, ~take_logarithm, ~round, ~standardize_as_zscore,
"fun_to_be_with", "AQNF_223_daqVV", FALSE, TRUE, "down", FALSE,
"work_ethics", "DJLNGJN3445_NFKS", TRUE, TRUE, "up", FALSE,
"bmi", "KLDJNSDOIJFW_295868FJDI", FALSE, FALSE, "up", TRUE,
"risk_for_diabetes", "DKFMDGNSQWRE_320586", FALSE, FALSE, "down", FALSE
)
但!{targets}
不允许通过字符串引用现有目标。
因此,如果我使用targets
为每个数据库表创建一个目标:
library(targets)
tar_target(raw_tbl_psych, tbl(con, "AQNF_223_daqVV") %>% collect())
tar_target(raw_tbl_work, tbl(con, "DJLNGJN3445_NFKS") %>% collect())
tar_target(raw_tbl_physical, tbl(con, "KLDJNSDOIJFW_295868FJDI") %>% collect())
tar_target(raw_tbl_blood, tbl(con, "DKFMDGNSQWRE_320586") %>% collect())
然后想要pmap()
迭代tbl_parameters_2
并为每个字符串db_name
替换它与相应的目标,那么它不会工作。
swap_table_ugly_name_for_nice_target_name <- function(ugly_name) {
switch(ugly_name,
# ugly_name # targets name
"AQNF_223_daqVV" = "raw_tbl_psych",
"DJLNGJN3445_NFKS" = "raw_tbl_work",
"KLDJNSDOIJFW_295868FJDI" = "raw_tbl_physical",
"DKFMDGNSQWRE_320586" = "raw_tbl_blood"
)
}
tar_target(list_of_wrangled_snowflakes,
wrangle_snowflake(snowflake_name = tbl_parameters_2$snowflake_name,
db_name = swap_table_ugly_name_for_nice_target_name(tbl_parameters_2$db_name),
replace_na_with_zero = tbl_parameters_2$replace_na_with_zero,
take_logarithm = tbl_parameters_2$take_logarithm,
round = tbl_parameters_2$round,
standardize_as_zscore = tbl_parameters_2$standardize_as_zscore)
)
好吧,它只是行不通。根据@landau,这是因为:
targets
使用静态代码分析检测依赖关系
对于那些到目前为止阅读的人,也许您知道如何结合迭代和引用预先存在的目标?