18

我正在尝试解决一个我无法通过谷歌搜索关键字解决的棘手的 R 问题。具体来说,我正在尝试获取一个数据帧的子集,其值不会出现在另一个数据帧中。这是一个例子:

> test
      number    fruit     ID1  ID2 
item1 "number1" "apples"  "22" "33"
item2 "number2" "oranges" "13" "33"
item3 "number3" "peaches" "44" "25"
item4 "number4" "apples"  "12" "13"
> test2
      number    fruit     ID1   ID2 
item1 "number1" "papayas" "22"  "33"
item2 "number2" "oranges" "13"  "33"
item3 "number3" "peaches" "441" "25"
item4 "number4" "apples"  "123" "13"
item5 "number3" "peaches" "44"  "25"
item6 "number4" "apples"  "12"  "13"
item7 "number1" "apples"  "22"  "33"

我有两个数据框,test 和 test2,目标是选择 test2 中未出现在 test 中的所有整行,即使其中一些值可能相同。

我想要的输出看起来像:

item1 "number1" "papayas" "22"  "33"
item2 "number3" "peaches" "441" "25"
item3 "number4" "apples"  "123" "13"

可能有任意数量的行或列,但在我的具体情况下,一个数据框是另一个数据框的直接子集。

我已经广泛使用了 R 的子集()、合并()和哪个()函数,但是如果可能的话,我不知道如何组合使用这些函数来获得我想要的东西。

编辑:这是我用来生成这两个表的 R 代码。

test <- data.frame(c("number1", "apples", 22, 33), c("number2", "oranges", 13, 33),
    c("number3", "peaches", 44, 25), c("number4", "apples", 12, 13))

test <- t(test)
rownames(test) = c("item1", "item2", "item3", "item4")
colnames(test) = c("number", "fruit", "ID1", "ID2")

test2 <- data.frame(data.frame(c("number1", "papayas", 22, 33), c("number2", "oranges", 13, 33),
    c("number3", "peaches", 441, 25), c("number4", "apples", 123, 13),c("number3", "peaches", 44, 25), c("number4", "apples", 12, 13)  ))

test2 <- t(test2)
rownames(test2) = c("item1", "item2", "item3", "item4", "item5", "item6")
colnames(test2) = c("number", "fruit", "ID1", "ID2")

提前致谢!

4

6 回答 6

16

这是另一种方式:

x <- rbind(test2, test)
x[! duplicated(x, fromLast=TRUE) & seq(nrow(x)) <= nrow(test2), ]
#        number   fruit ID1 ID2
# item1 number1 papayas  22  33
# item3 number3 peaches 441  25
# item4 number4  apples 123  13

编辑:修改为保留行名。

于 2013-07-02T14:24:05.147 回答
4

有两种方法可以解决这个问题,使用 data.table 和 sqldf

library(data.table)
test<- fread('
item number fruit ID1 ID2 
item1 "number1" "apples"  "22" "33"
item2 "number2" "oranges" "13" "33"
item3 "number3" "peaches" "44" "25"
item4 "number4" "apples"  "12" "13"
')
test2<- fread('
item number fruit ID1 ID2 
item1 "number1" "papayas" "22"  "33"
item2 "number2" "oranges" "13"  "33"
item3 "number3" "peaches" "441" "25"
item4 "number4" "apples"  "123" "13"
item5 "number3" "peaches" "44"  "25"
item6 "number4" "apples"  "12"  "13"
item7 "number1" "apples"  "22"  "33"
')

data.table 方法,这使您可以选择要比较的列

setkey(test,item,number,fruit,ID1,ID2)
setkey(test2,item,number,fruit,ID1,ID2)
test[!test2]
item  number   fruit ID1 ID2
1: item1 number1  apples  22  33
2: item3 number3 peaches  44  25
3: item4 number4  apples  12  13

sql方法

sqldf('select * from test except select * from test2')
item  number   fruit ID1 ID2
1: item1 number1  apples  22  33
2: item3 number3 peaches  44  25
3: item4 number4  apples  12  13
于 2016-05-17T09:58:33.340 回答
3

以下内容应该可以帮助您:

rows <- unique(unlist(mapply(function(x, y) 
          sapply(setdiff(x, y), function(d) which(x==d)), test2, test1)))
test2[rows, ]

这里发生的是:

  • mapply用于在两个数据集之间进行逐列比较。
  • 它用于setdiff查找在前者但不是后者的任何项目
  • which标识前者的哪一行不存在。
  • unique(unlist(....))抓取所有唯一的行

  • 然后我们将其用作前者的过滤器,即test2

结果:

       number   fruit ID1 ID2
item1 number1 papayas  22  33
item3 number3 peaches 441  25
item4 number4  apples 123  13

编辑:

确保您的test& test2aredata.frames和 not matrices,因为它mapply遍历矩阵的每个元素,但遍历 a 的每一data.frame

test  <- as.data.frame(test,  stringsAsFactors=FALSE)
test2 <- as.data.frame(test2, stringsAsFactors=FALSE)
于 2013-07-02T14:18:56.167 回答
2

使用包 dplyr,您也可以使用 anti_join。

missing.species <- anti_join(test2, test, by = NULL)

它将返回在测试中没有匹配的 test2 行。通过显式加入的变量。如果为 NULL,该函数将使用 test 和 test2 中的所有共同变量。

于 2020-06-25T09:52:43.647 回答
1

在 test2 中新建一个 row-ID 列,合并数据框,然后选择那些 ID 不在合并结果中的行。

test2 <- cbind(test2, id=seq_len(nrow(test2)))

matches <- merge(test1, test2)$id

test2 <- test2[-matches, ]
于 2013-07-02T14:16:51.313 回答
1

这是另一种方法,但我不确定它的扩展性如何。

test2[!apply(test2, 1, paste, collapse = "") %in% 
        apply(test, 1, paste, collapse = ""), ]
#       number    fruit     ID1   ID2 
# item1 "number1" "papayas" "22"  "33"
# item3 "number3" "peaches" "441" "25"
# item4 "number4" "apples"  "123" "13"

不会删除所有重复项。比较,例如,如果test2有重复:

test2 <- rbind(test2, test2[1:3, ])

## Matthew's answer: Duplicates dropped
x <- rbind(test2, test)
x[! duplicated(x, fromLast=TRUE) & seq(nrow(x)) <= nrow(test2), ]
#       number    fruit     ID1   ID2 
# item4 "number4" "apples"  "123" "13"
# item1 "number1" "papayas" "22"  "33"
# item3 "number3" "peaches" "441" "25"

## This one: Duplicates retained
test2[!apply(test2, 1, paste, collapse = "") %in%
  apply(test, 1, paste, collapse = ""), ]
#       number    fruit     ID1   ID2 
# item1 "number1" "papayas" "22"  "33"
# item3 "number3" "peaches" "441" "25"
# item4 "number4" "apples"  "123" "13"
# item1 "number1" "papayas" "22"  "33"
# item3 "number3" "peaches" "441" "25"
于 2013-07-02T15:43:46.547 回答