3

当我不使用标准的“col1”=“col2”连接时,我很难让 dplyr 连接工作。这是我正在经历的两个例子。

首先:

library(dplyr)

tableA <- data.frame(col1= c("a","b","c","d"),
                     col2 = c(1,2,3,4))

inner_join(tableA, tableA, by = c("col1"!="col1")) %>% 
  select(col1, col2.x) %>% 
  arrange(col1, col2.x)

错误:by对于自然连接,必须是(命名的)字符向量、列表或 NULL(不推荐在生产代码中使用),不符合逻辑

当我复制此代码但使用 sql 时,我得到以下信息:

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

copy_to(con, tableA)

tbl(con, sql("select a.col1, b.col2
              from 
              tableA as a
              inner join 
              tableA as b
              on a.col1 <> b.col1")) %>% 
  arrange(col1, col2)

sql查询的结果:

# Source:     SQL [?? x 2]
# Database:   sqlite 3.19.3 [:memory:]
# Ordered by: col1, col2
     col1  col2
     <chr> <dbl>
 1     a     2
 2     a     3
 3     a     4
 4     b     1
 5     b     3
 6     b     4
 7     c     1
 8     c     2
 9     c     4
10     d     1
# ... with more rows

第二部分与上一部分类似:

inner_join(tableA, tableA, by = c("col1" > "col1")) %>% 
   select(col1, col2.x) %>% 
   arrange(col1, col2.x)

错误:by对于自然连接,必须是(命名的)字符向量、列表或 NULL(不推荐在生产代码中使用),不符合逻辑

Sql 等价物:

tbl(con, sql("select a.col1, b.col2
              from tableA as a
              inner join tableA as b
              on a.col1 > b.col1")) %>% 
   arrange(col1, col2)

第二个 sql 查询的结果:

# Source:     SQL [?? x 2]
# Database:   sqlite 3.19.3 [:memory:]
# Ordered by: col1, col2
   col1  col2
  <chr> <dbl>
1     b     1
2     c     1
3     c     2
4     d     1
5     d     2
6     d     3

有谁知道如何创建这些 sql 示例但使用 dplyr 代码?

4

2 回答 2

3

对于您的第一个案例:

library(dplyr)
library(tidyr)

expand(tableA, col1, col2) %>% 
  left_join(tableA, by = 'col1') %>% 
  filter(col2.x != col2.y) %>% 
  select(col1, col2 = col2.x)

结果:

# A tibble: 12 x 2
     col1  col2
   <fctr> <dbl>
 1      a     2
 2      a     3
 3      a     4
 4      b     1
 5      b     3
 6      b     4
 7      c     1
 8      c     2
 9      c     4
10      d     1
11      d     2
12      d     3

对于您的第二种情况:

expand(tableA, col1, col2) %>% 
  left_join(tableA, by = 'col1') %>% 
  filter(col2.x < col2.y) %>% 
  select(col1, col2 = col2.x)

结果:

# A tibble: 6 x 2
    col1  col2
  <fctr> <dbl>
1      b     1
2      c     1
3      c     2
4      d     1
5      d     2
6      d     3
于 2017-11-24T19:25:48.797 回答
1

使用dplyr和的解决方案tidyr。这个想法是扩展数据框,然后对原始数据框执行连接。之后,使用fillfromtidyr填写NA到以前的记录。最后,过滤掉具有相同值和 的记录NA

library(dplyr)
library(tidyr)

tableB <- tableA %>%
  complete(col1, col2) %>%
  left_join(tableA %>% mutate(col3 = col2), by = c("col1", "col2")) %>%
  group_by(col1) %>%
  fill(col3, .direction = "up") %>%
  filter(col2 != col3, !is.na(col3)) %>%
  select(-col3) %>%
  ungroup()
tableB
# # A tibble: 6 x 2
#    col1  col2
#   <chr> <dbl>
# 1     b     1
# 2     c     1
# 3     c     2
# 4     d     1
# 5     d     2
# 6     d     3

数据

tableA <- data.frame(col1= c("a","b","c","d"),
                     col2 = c(1,2,3,4), stringsAsFactors = FALSE)
于 2017-11-24T18:31:21.923 回答