我在对tbl_sql
包含同名列的 sqlite 小标题(对象)进行自然连接时遇到了问题,这些列中包含 NA 值(或缺失值,我想)。
library(DBI)
library(dplyr)
library(dbplyr)
## modify mtcars for example
modcars <- mtcars
modcars[["NAs"]] <- c(rep(1, 3), rep(NA, 29))
## store modcars in sql table and get it
mydb <- dbConnect(RSQLite::SQLite(), "")
dbWriteTable(mydb, "modcars", modcars)
srcdbi_mydb <- src_dbi(mydb)
tbl_modcars <- tbl(srcdbi_mydb, "modcars")
modcars %>% head
#> mpg cyl disp hp drat wt qsec vs am gear carb NAs
#> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 1
#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 1
#> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 1
#> Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 NA
#> Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 NA
#> Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 NA
tbl_modcars %>% head
#> # Source: lazy query [?? x 12]
#> # Database: sqlite 3.19.3 []
#> mpg cyl disp hp drat wt qsec vs am gear carb NAs
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 1
#> 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 1
#> 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 1
#> 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 NA
#> 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 NA
#> 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 NA
请注意,将这两个表与它们自身进行内部连接的输出存在差异。这是由于 dplyr 和 sqlite 处理缺失值的方式不同。
inner_join(modcars, modcars) %>% head
#> Joining, by = c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb", "NAs")
#> mpg cyl disp hp drat wt qsec vs am gear carb NAs
#> 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 1
#> 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 1
#> 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 1
#> 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 NA
#> 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 NA
#> 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 NA
inner_join(tbl_modcars, tbl_modcars) %>% head
#> Joining, by = c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb", "NAs")
#> # Source: lazy query [?? x 12]
#> # Database: sqlite 3.19.3 []
#> mpg cyl disp hp drat wt qsec vs am gear carb NAs
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 1
#> 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 1
#> 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 1
实际上,我希望在小标题 上inner_join()
调用modcars
data.frames
时出现相同的行。inner_join()
tbl_modcars
我意识到我可以简单地使用以下代码来获得所需的输出:
joinee1 <- tbl_modcars %>% select(setdiff(colnames(tbl_modcars), "NAs"))
inner_join(joinee1, tbl_modcars) %>% head
#> Joining, by = c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb")
#> # Source: lazy query [?? x 12]
#> # Database: sqlite 3.19.3 []
#> mpg cyl disp hp drat wt qsec vs am gear carb NAs
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 1
#> 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 1
#> 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 1
#> 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 NA
#> 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 NA
#> 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 NA
但是,这会忽略列中任何非 NA 信息的连接NAs
(如果适用)。此外,我宁愿只执行一次 dplyr 调用而不是两次(如果调用太多,解析器堆栈溢出可能会成为问题)。
任何解决方案或澄清表示赞赏。