1

我正在尝试将两组单词与字符串数进行匹配。这两组词是 car 和 school,我使用 stringr 包将其设置为匹配来自 car 或 school 的词的任何实例。

library(stringr)
car <- c("Honda", "Chevy", "Toyota", "Ford")
school <- c("Michigan", "Ohio State", "Missouri")
car_match <- str_c(car, collapse = "|")
school_match <- str_c(school, collapse = "|")
df <- data.frame(keyword=c("He drives a Honda", 
                           "He goes to Ohio State", 
                           "He likes Ford and goes to Ohio State"))
df

main <- function(df) {
  df$car <- as.numeric(str_detect(df$keyword, car_match))
  df$school <- as.numeric(str_detect(df$keyword, school_match))
  df
}
main(df)

> main(df)
                               keyword car school
1                    He drives a Honda   1      0
2                He goes to Ohio State   0      1
3 He likes Ford and goes to Ohio State   1      1

太好了,这行得通。

现在,我想回去看看是否可以轻松计算出汽车和学校“桶”中每个单词的频率。

所以它应该如下所示

Car        Freq
Honda      1
Chevy      0 
Toyota     0
Ford       1

school     Freq
Michigan    0
Ohio State  2
Missouri    0

因为本田在汽车分类中出现一次,所以它的频率计数为1。同样,在学校分类中出现两次的俄亥俄州立大学的频率为两次。

谁能帮我从分类匹配到找到分类中每个单词的频率?

我可能会回去并将 car 中的每个单词设置为它自己的 str_c 并以这种方式匹配,但我想找到一条“更简单”的路线。

4

2 回答 2

2

您可以使用 qdap 包执行此任务,如下所示:

library(qdap)
key <- list(
    car = c("Honda", "Chevy", "Toyota", "Ford"),
    school = c("Michigan", "Ohio State", "Missouri")
)

(out <- with(df, termco(keyword, keyword, key, elim.old = FALSE)))
counts(out)

##                                keyword word.count Honda Chevy Toyota Ford Michigan Ohio State Missouri car school
## 1                    He drives a Honda          4     1     0      0    0        0          0        0   1      0
## 2                He goes to Ohio State          5     0     0      0    0        0          1        0   0      1
## 3 He likes Ford and goes to Ohio State          8     0     0      0    1        0          1        0   1      1

colSums(counts(out)[, -1])

## word.count      Honda      Chevy     Toyota       Ford   Michigan Ohio State   Missouri        car     school 
##         17          1          0          0          1          0          2          0          2          2 
于 2014-04-30T00:14:49.563 回答
2

也许是这样的:

sapply(car, function(x) sum(str_count(df$keyword, x)))
# Honda  Chevy Toyota   Ford 
#     1      0      0      1 

sapply(school, function(x) sum(str_count(df$keyword, x)))
# Michigan Ohio State   Missouri 
#        0          2          0
于 2014-04-29T20:51:56.730 回答