1

我有如下三个数据框:

df3 <- data.frame(col1=c('A','C','E'),col2=c(4,8,2))
df2 <- data.frame(col1=c('A','B','C','E','I'),col2=c(4,6,8,2,9))
df1 <- data.frame(col1=c('A','D','C','E','I'),col2=c(4,7,8,2,9))

任何两个文件之间的差异可能如下:

anti_join(df2, df3)
# Joining, by = c("col1", "col2")
#   col1 col2
# 1    B    6
# 2    I    9

anti_join(df3, df2)
# Joining, by = c("col1", "col2")
# [1] col1 col2
# <0 rows> (or 0-length row.names)

anti_join(df1, df2)
# Joining, by = c("col1", "col2")
#   col1 col2
# 1    D    7

anti_join(df2, df1)
# Joining, by = c("col1", "col2")
#   col1 col2
# 1    B    6

我想创建一个主数据框,其中包含每个数据框中的所有值col1 并且col2特定于每个数据框。如果不存在这样的值,它应该填充NA.

  col1 df1_col2 df2_col2 df3_col2
1    A        4        4        4 
2    B       NA        6       NA  
3    C        8        8        8
4    E        2        2        2 
5    I        9        9       NA
6    D        7       NA       NA

上述输出的本质可以从上述anti_join命令建立。但是,它并没有立即提供完整的图片。关于如何实现这一目标的任何想法?

编辑:对于 for 中的多个值col2col1输出有点混乱。例如,A有值4, 3

df3 <- data.frame(col1=c('A','C','E'),col2=c(4,8,2))
df2 <- data.frame(col1=c('A','A','B','C','E','I'),col2=c(4,3,6,8,2,9))
df1 <- data.frame(col1=c('A','A','D','C','E','I'),col2=c(4,3,7,8,2,9))

lst_of_frames <- list(df1 = df1, df2 = df2, df3 = df3)
lst_of_frames %>%
  imap(~ rename_at(.x, -1, function(z) paste(.y, z, sep = "_"))) %>%
  reduce(full_join, by = "col1")

它给出了以下输出。

#   col1 df1_col2 df2_col2 df3_col2
# 1    A        4        4        4
# 2    A        4        3        4
# 3    A        3        4        4
# 4    A        3        3        4
# 5    D        7       NA       NA
# 6    C        8        8        8
# 7    E        2        2        2
# 8    I        9        9       NA
# 9    B       NA        6       NA

输出中有趣的部分是:

#   col1 df1_col2 df2_col2 df3_col2
# 1    A        4        4        4
# 2    A        4        3        4
# 3    A        3        4        4
# 4    A        3        3        4

而预期的输出是:

#   col1 df1_col2 df2_col2 df3_col2
# 1    A        4        4        4
# 2    A        3        3       NA
4

2 回答 2

4

您可以使用包中的full_join功能dplyr

df_master <- df1 %>% 
  full_join(df2, by = "col1") %>% 
  full_join(df3, by = "col1") %>% 
  select(col1, df1_col2 = col2.x, 
         df2_col2 = col2.y,
         df3_col2 = col2)

  col1 df1_col2 df2_col2 df3_col2
1    A        4        4        4
2    D        7       NA       NA
3    C        8        8        8
4    E        2        2        2
5    I        9        9       NA
6    B       NA        6       NA
于 2020-08-28T16:34:51.803 回答
1

类似于@tamtam 的答案,但如果您有动态的帧列表,则有点程序化。

lst_of_frames <- list(df1 = df1, df2 = df2, df3 = df3)
# lst_of_frames <- tibble::lst(df1, df2, df3)    # thanks, @user63230
library(dplyr)
library(purrr)  # imap, reduce
lst_of_frames %>%
  imap(~ rename_at(.x, -1, function(z) paste(.y, z, sep = "_"))) %>%
  reduce(full_join, by = "col1")
#   col1 df1_col2 df2_col2 df3_col2
# 1    A        4        4        4
# 2    D        7       NA       NA
# 3    C        8        8        8
# 4    E        2        2        2
# 5    I        9        9       NA
# 6    B       NA        6       NA

框架列表是命名列表很重要(对于自动重命名列);我的假设是框架变量的名称list(df1=df1),但它可以很容易地list(A=df1)生成A_col2最后命名的列。

于 2020-08-28T16:53:58.703 回答