2

我在对tbl_sql包含同名列的 sqlite 小标题(对象)进行自然连接时遇到了问题,这些列中包含 NA 值(或缺失值,我想)。

library(DBI)
library(dplyr)
library(dbplyr)

## modify mtcars for example
modcars <- mtcars
modcars[["NAs"]] <- c(rep(1, 3), rep(NA, 29))

## store modcars in sql table and get it
mydb <- dbConnect(RSQLite::SQLite(), "")
dbWriteTable(mydb, "modcars", modcars)
srcdbi_mydb <- src_dbi(mydb)
tbl_modcars <- tbl(srcdbi_mydb, "modcars")

modcars %>% head
#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb NAs
#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4   1
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4   1
#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1   1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1  NA
#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2  NA
#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1  NA

tbl_modcars %>% head
#> # Source:   lazy query [?? x 12]
#> # Database: sqlite 3.19.3 []
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb   NAs
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  21.0     6   160   110  3.90 2.620 16.46     0     1     4     4     1
#> 2  21.0     6   160   110  3.90 2.875 17.02     0     1     4     4     1
#> 3  22.8     4   108    93  3.85 2.320 18.61     1     1     4     1     1
#> 4  21.4     6   258   110  3.08 3.215 19.44     1     0     3     1    NA
#> 5  18.7     8   360   175  3.15 3.440 17.02     0     0     3     2    NA
#> 6  18.1     6   225   105  2.76 3.460 20.22     1     0     3     1    NA

请注意,将这两个表与它们自身进行内部连接的输出存在差异。这是由于 dplyr 和 sqlite 处理缺失值的方式不同。

inner_join(modcars, modcars) %>% head
#> Joining, by = c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb", "NAs")
#>    mpg cyl disp  hp drat    wt  qsec vs am gear carb NAs
#> 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4   1
#> 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4   1
#> 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1   1
#> 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1  NA
#> 5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2  NA
#> 6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1  NA

inner_join(tbl_modcars, tbl_modcars) %>% head
#> Joining, by = c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb", "NAs")
#> # Source:   lazy query [?? x 12]
#> # Database: sqlite 3.19.3 []
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb   NAs
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  21.0     6   160   110  3.90 2.620 16.46     0     1     4     4     1
#> 2  21.0     6   160   110  3.90 2.875 17.02     0     1     4     4     1
#> 3  22.8     4   108    93  3.85 2.320 18.61     1     1     4     1     1

实际上,我希望在小标题 上inner_join()调用modcars data.frames时出现相同的行。inner_join()tbl_modcars

我意识到我可以简单地使用以下代码来获得所需的输出:

joinee1 <- tbl_modcars %>% select(setdiff(colnames(tbl_modcars), "NAs"))
inner_join(joinee1, tbl_modcars) %>% head
#> Joining, by = c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb")
#> # Source:   lazy query [?? x 12]
#> # Database: sqlite 3.19.3 []
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb   NAs
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  21.0     6   160   110  3.90 2.620 16.46     0     1     4     4     1
#> 2  21.0     6   160   110  3.90 2.875 17.02     0     1     4     4     1
#> 3  22.8     4   108    93  3.85 2.320 18.61     1     1     4     1     1
#> 4  21.4     6   258   110  3.08 3.215 19.44     1     0     3     1    NA
#> 5  18.7     8   360   175  3.15 3.440 17.02     0     0     3     2    NA
#> 6  18.1     6   225   105  2.76 3.460 20.22     1     0     3     1    NA

但是,这会忽略列中任何非 NA 信息的连接NAs(如果适用)。此外,我宁愿只执行一次 dplyr 调用而不是两次(如果调用太多,解析器堆栈溢出可能会成为问题)。

任何解决方案或澄清表示赞赏。

4

0 回答 0