r - Subsetting Data based on whether one multiple variables are/aren't included in list

Question

I am trying to make two subsets my data if any one of 5 columns (5-10) contains a factor within my list (keep.list) and one where none of the cols contain anything from the keep.list. Here's where I am so far but can't get it to subset right:

test.cols <- c(5:10)
keep.list <- c("dog","cat","mouse","bird")

data.sub.IN <- data.big[which(any(keep.list %in% data.big[test.cols])),]

data.sub.NOT.IN <- data.big[which(any(keep.list !%in% data.big[test.cols])),]

I think which() and any() can help but I might be wrong, and I don't know how to handle to "not included" case, as the usual ! command isn't working.

score 3 · Accepted Answer

You can do this using apply:

keep <- apply(data.big[test.cols], 1, function(r) any(r %in% keep.list))
data.sub.IN <- data.big[keep, ]
data.sub.NOT.IN <- data.big[!keep, ]

apply applies a function to each row of the data frame. In this case, for each row, it checks whether any of the items in that row are in keep.list.

score 1 · Accepted Answer

I'd go with @DavidRobinsons's answer, but if you want to keep it in the form it is, you need to move the !. To negate %in%, you put the ! before the first part of your logical operator.

B <- 1:4
A <- 3:6
A %in% B
[1]  TRUE  TRUE FALSE FALSE
!A %in% B
[1] FALSE FALSE  TRUE  TRUE

So for your case:

data.sub.NOT.IN <- data.big[which(any(!keep.list %in% data.big[test.cols])),]

But, again, in this case using apply is a better option, I think.

EDIT Based on @DWin's comment, this may not work (hard to tell without example dataset), you might actually need:

data.sub.NOT.IN <- data.big[which(!any(keep.list %in% data.big[test.cols])),]

r - Subsetting Data based on whether one multiple variables are/aren't included in list

2 回答 2

Related

Reference