r - Filter multiple values on a string column in dplyr

Question

I have a data.frame with character data in one of the columns. I would like to filter multiple options in the data.frame from the same column. Is there an easy way to do this that I'm missing?

Example: data.frame name = dat

days      name
88        Lynn
11          Tom
2           Chris
5           Lisa
22        Kyla
1          Tom
222      Lynn
2         Lynn

I'd like to filter out Tom and Lynn for example.
When I do:

target <- c("Tom", "Lynn")
filt <- filter(dat, name == target)

I get this error:

longer object length is not a multiple of shorter object length

score 242 · Accepted Answer

您需要%in%代替==：

library(dplyr)
target <- c("Tom", "Lynn")
filter(dat, name %in% target)  # equivalently, dat %>% filter(name %in% target)

生产

  days name
1   88 Lynn
2   11  Tom
3    1  Tom
4  222 Lynn
5    2 Lynn

要了解原因，请考虑此处发生的情况：

dat$name == target
# [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

基本上，我们将两个长度target向量循环四次以匹配dat$name. 换句话说，我们正在做：

 Lynn == Tom
  Tom == Lynn
Chris == Tom
 Lisa == Lynn
 ... continue repeating Tom and Lynn until end of data frame

在这种情况下，我们不会收到错误，因为我怀疑您的数据框实际上有不同数量的不允许回收的行，但您提供的示例确实有（8 行）。如果样本有奇数行，我会得到和你一样的错误。但即使回收工作，这显然不是你想要的。基本上，该声明dat$name == target相当于说：

返回TRUE每个等于“Tom”的奇数值或每个等于“Lynn”的偶数值。

碰巧您的示例数据框中的最后一个值是偶数并且等于“Lynn”，因此是TRUE上面的那个。

相比之下，dat$name %in% target说：

对于中的每个值dat$name，检查它是否存在于中target。

非常不一样。结果如下：

[1]  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE

请注意，您的问题与 . 无关dplyr，只是误用==.

score 13 · Accepted Answer

这可以使用 CRAN 中提供的 dplyr 包来实现。实现这一目标的简单方法：

安装dplyr包。
运行以下代码

library(dplyr) 

df<- select(filter(dat,name=='tom'| name=='Lynn'), c('days','name))

解释：

所以，一旦我们下载了 dplyr，我们就使用这个包中的两个不同的函数来创建一个新的数据框：

filter：第一个参数是数据框；第二个参数是我们希望它子集化的条件。结果是整个数据框只有我们想要的行。select：第一个参数是数据框；第二个参数是我们想要从中选择的列的名称。我们不必使用 names() 函数，甚至不必使用引号。我们只是将列名列为对象。

score 12 · Accepted Answer

使用base包：

df <- data.frame(days = c(88, 11, 2, 5, 22, 1, 222, 2), name = c("Lynn", "Tom", "Chris", "Lisa", "Kyla", "Tom", "Lynn", "Lynn"))

# Three lines
target <- c("Tom", "Lynn")
index <- df$name %in% target
df[index, ]

# One line
df[df$name %in% c("Tom", "Lynn"), ]

输出：

  days name
1   88 Lynn
2   11  Tom
6    1  Tom
7  222 Lynn
8    2 Lynn

使用sqldf：

library(sqldf)
# Two alternatives:
sqldf('SELECT *
      FROM df 
      WHERE name = "Tom" OR name = "Lynn"')
sqldf('SELECT *
      FROM df 
      WHERE name IN ("Tom", "Lynn")')

score 1 · Accepted Answer

 by_type_year_tag_filtered <- by_type_year_tag %>%
      dplyr:: filter(tag_name %in% c("dplyr", "ggplot2"))

score 0 · Accepted Answer

如果您的字符串列中有长字符串作为值，您可以在stringr包中使用这个强大的方法。一种filter( %in% )基础 R 无法做到的方法。

library(dplyr)
library(stringr)

sentences_tb = as_tibble(sentences) %>%
                 mutate(row_number())
sentences_tb
# A tibble: 720 x 2
   value                                       `row_number()`
   <chr>                                                <int>
 1 The birch canoe slid on the smooth planks.               1
 2 Glue the sheet to the dark blue background.              2
 3 Its easy to tell the depth of a well.                   3
 4 These days a chicken leg is a rare dish.                 4
 5 Rice is often served in round bowls.                     5
 6 The juice of lemons makes fine punch.                    6
 7 The box was thrown beside the parked truck.              7
 8 The hogs were fed chopped corn and garbage.              8
 9 Four hours of steady work faced us.                      9
10 Large size in stockings is hard to sell.                10
# ... with 710 more rows                

matching_letters <- c(
  "canoe","dark","often","juice","hogs","hours","size"
)
matching_letters <- str_c(matching_letters, collapse = "|")
matching_letters
[1] "canoe|dark|often|juice|hogs|hours|size"

letters_found <- str_subset(sentences_tb$value,matching_letters)
letters_found_tb = as_tibble(letters_found)
inner_join(sentences_tb,letters_found_tb)

# A tibble: 16 x 2
   value                                          `row_number()`
   <chr>                                                   <int>
 1 The birch canoe slid on the smooth planks.                  1
 2 Glue the sheet to the dark blue background.                 2
 3 Rice is often served in round bowls.                        5
 4 The juice of lemons makes fine punch.                       6
 5 The hogs were fed chopped corn and garbage.                 8
 6 Four hours of steady work faced us.                         9
 7 Large size in stockings is hard to sell.                   10
 8 Note closely the size of the gas tank.                     33
 9 The bark of the pine tree was shiny and dark.             111
10 Both brothers wear the same size.                         253
11 The dark pot hung in the front closet.                    261
12 Grape juice and water mix well.                           383
13 The wall phone rang loud and often.                       454
14 The bright lanterns were gay on the dark lawn.            476
15 The pleasant hours fly by much too soon.                  516
16 A six comes up more often than a ten.                     609

它有点冗长，但如果您有长字符串并且想要过滤特定单词所在的行，它非常方便且功能强大。

与接受的答案比较：

> target <- c("canoe","dark","often","juice","hogs","hours","size")
> filter(sentences_tb, value %in% target)
# A tibble: 0 x 2
# ... with 2 variables: value <chr>, row_number() <int>

> df<- select(filter(sentences_tb,value=='canoe'| value=='dark'), c('value','row_number()'))
> df
# A tibble: 0 x 2
# ... with 2 variables: value <chr>, row_number() <int>

> target <- c("canoe","dark","often","juice","hogs","hours","size")
> index <- sentences_tb$value %in% target
> sentences_tb[index, ]
# A tibble: 0 x 2
# ... with 2 variables: value <chr>, row_number() <int>

您需要编写所有句子才能获得所需的结果。

r - Filter multiple values on a string column in dplyr

5 回答 5

它有点冗长，但如果您有长字符串并且想要过滤特定单词所在的行，它非常方便且功能强大。

Related

Reference