40

我正在尝试研究如何使用dplyrand过滤来自大型数据集的一些观察结果greplgrepl如果其他解决方案会更理想,我不喜欢。

拿这个样本df:

df1 <- data.frame(fruit=c("apple", "orange", "xapple", "xorange", 
                          "applexx", "orangexx", "banxana", "appxxle"), group=c("A", "B") )
df1


#     fruit group
#1    apple     A
#2   orange     B
#3   xapple     A
#4  xorange     B
#5  applexx     A
#6 orangexx     B
#7  banxana     A
#8  appxxle     B

我想要:

  1. 过滤掉那些以“x”开头的案例
  2. 过滤掉那些以 'xx' 结尾的情况

我已经设法弄清楚如何摆脱包含“x”或“xx”的所有内容,但不是以开头或结尾。以下是如何摆脱内部带有“xx”的所有内容(不仅仅是以结尾):

df1 %>%  filter(!grepl("xx",fruit))

#    fruit group
#1   apple     A
#2  orange     B
#3  xapple     A
#4 xorange     B
#5 banxana     A

这显然是“错误地”(从我的角度来看)过滤了“appxxle”。

I have never fully got to grips with regular expressions. I've been trying to modify code such as: grepl("^(?!x).*$", df1$fruit, perl = TRUE) to try and make it work within the filter command, but am not quite getting it.

Expected output:

#      fruit group
#1     apple     A
#2    orange     B
#3   banxana     A
#4   appxxle     B

I'd like to do this inside dplyr if possible.

4

1 回答 1

50

I didn't understand your second regex, but this more basic regex seems to do the trick:

df1 %>% filter(!grepl("^x|xx$", fruit))
###
    fruit group
1   apple     A
2  orange     B
3 banxana     A
4 appxxle     B

And I assume you know this, but you don't have to use dplyr here at all:

df1[!grepl("^x|xx$", df1$fruit), ]
###
    fruit group
1   apple     A
2  orange     B
7 banxana     A
8 appxxle     B

The regex is looking for strings that start with x OR end with xx. The ^ and $ are regex anchors for the beginning and ending of the string respectively. | is the OR operator. We're negating the results of grepl with the ! so we're finding strings that don't match what's inside the regex.

于 2014-09-23T16:01:51.293 回答