所以我开始涉足 dplyr 编程的美妙世界。我正在尝试编写一个接受 data.frame、目标列和任意数量的分组列的函数(对所有列使用裸名)。然后,该函数将根据目标列对数据进行分箱,并计算每个分箱中的条目数。我想为原始 data.frame() 中存在的分组变量的每个组合保留一个单独的 bin 大小,因此我使用 complete() 和 nesting() 函数来执行此操作。这是我正在尝试执行的操作以及遇到的错误的示例:


#Prepare test data
test_data =
    data.frame(Gene_ID = rep(paste0("Gene.", 1:10), times=4),
               Comparison = rep(c("WT_vs_Mut1", "WT_vs_Mut2"), each=10, times=2),
               Test_method = rep(c("T-test", "MannWhitney"), each=20),
               P_value = runif(40))

#Perform operation manually
test_data %>% 
    #Start by binning the data according to q-value
    mutate(Probability.bin = cut(P_value,
                                 breaks = c(-Inf, seq(0.1, 1, by=0.1), Inf),
                                 labels = c(seq(0.0, 1.0, by=0.1)),
                                 right = FALSE)) %>% 
    #Now summarize the results by bin.
    count(Comparison, Test_method, Probability.bin) %>% 
    #Fill in any missing bins with 0 counts
    complete(nesting(Comparison, Test_method), Probability.bin,
             fill=list(n = 0))

#Create function that accepts bare column names
bin_by_p_value <- function(df,
                           pvalue_col, #Bare name of p-value column
                           ...) {      #Bare names of grouping columns

    #"Quote" column names so they are ready for use below
    pvalue_col_name <- enquo(pvalue_col)
    group_by_cols <- quos(...)

    #Perform the operation
    df %>% 
        #Start by binning the data according to q-value
        mutate(Probability.bin = cut(UQ(pvalue_col_name),
                                     breaks = c(-Inf, seq(0.1, 1, by=0.1), Inf),
                                     labels = c(seq(0.0, 1.0, by=0.1)),
                                     right = FALSE)) %>% 
        #Now summarize the results by bin.
        count(UQS(group_by_cols), Probability.bin) %>% 
        #Fill in any missing bins with 0 counts
        complete(nesting(UQS(group_by_cols)), Probability.bin,
                 # complete(nesting(UQS(group_by_cols)), Probability.bin,
                 fill=list(n = 0))

#Use function to perform operation
test_data %>% 
    bin_by_p_value(P_value, Comparison, Test_method)




complete(nesting(UQS(group_by_cols)), Probability.bin...

如果我删除对 nesting() 的调用,则代码执行时不会出现错误。但是,我想保留仅使用原始数据中存在的分组变量组合的功能,然后使用 bin 获取所有可能的组合,这样我就可以填充所有丢失的 bin。根据错误名称和失败的地方,我的猜测是这是一个范围/环境问题,我真的应该为嵌套()中的分组变量使用不同的环境,因为它包含在对完成()的调用中。但是,我对 dplyr 编程很陌生,我不知道该怎么做。

我试图通过将分组列合并到一个列中来解决这个问题,然后使用该合并列作为 complete() 的输入。这让我可以按照我想要的方式执行 complete() 操作,同时避免使用 nesting() 函数。但是,当我想分离回原始分组列时遇到了麻烦,因为我不知道如何将 quosures 列表转换为字符向量(separate() 的“into”参数所必需的)。以下是说明我在说什么的代码片段:

        #Fill in any missing bins with 0 counts
        unite(Merged_grouping_cols, UQS(group_by_cols), sep="*") %>% 
        complete(Merged_grouping_cols, Probability.bin,
                 fill=list(n = 0)) %>%
        separate(Merged_grouping_cols, into=c("What goes here?"), sep="\\*")

以下是相关版本信息:R 版本 3.4.2 (2017-09-28)、tidyr_0.7.2、dplyr_0.7.4

我很感激任何解决方法,但我想知道我在做什么以错误的方式摩擦 complete() 和 nesting() 。


1 回答 1

  • 使用卷曲{{}}pvalue_col
  • 将点 ( ...) 直接传递给count.
  • ensyms!!!in一起使用nesting
bin_by_p_value <- function(df,
                           pvalue_col, #Bare name of p-value column
                           ...) {      #Bare names of grouping columns
  #Perform the operation
  df %>% 
    #Start by binning the data according to q-value
    mutate(Probability.bin = cut({{pvalue_col}},
                                 breaks = c(-Inf, seq(0.1, 1, by=0.1), Inf),
                                 labels = c(seq(0.0, 1.0, by=0.1)),
                                 right = FALSE)) %>% 
    #Now summarize the results by bin.
    count(..., Probability.bin) %>% 
    #Fill in any missing bins with 0 counts
    complete(nesting(!!!ensyms(...)), Probability.bin,   fill=list(n = 0))

test_data %>% bin_by_p_value(P_value, Comparison, Test_method)

# A tibble: 44 x 4
#   Comparison Test_method Probability.bin     n
#   <chr>      <chr>       <fct>           <dbl>
# 1 WT_vs_Mut1 MannWhitney 0                   1
# 2 WT_vs_Mut1 MannWhitney 0.1                 1
# 3 WT_vs_Mut1 MannWhitney 0.2                 0
# 4 WT_vs_Mut1 MannWhitney 0.3                 1
# 5 WT_vs_Mut1 MannWhitney 0.4                 1
# 6 WT_vs_Mut1 MannWhitney 0.5                 1
# 7 WT_vs_Mut1 MannWhitney 0.6                 0
# 8 WT_vs_Mut1 MannWhitney 0.7                 0
# 9 WT_vs_Mut1 MannWhitney 0.8                 1
#10 WT_vs_Mut1 MannWhitney 0.9                 4
# … with 34 more rows


identical(res, test_data %>% bin_by_p_value(P_value, Comparison, Test_method))
#[1] TRUE
于 2021-06-02T04:11:32.137 回答