2

我有一个项目列表和一个搜索词列表,我正在尝试做两件事:

  1. 在项目中搜索与任何搜索词的匹配项,如果找到匹配项,则返回 true。
  2. 对于所有返回 true 的项目(即存在匹配项),我还想返回在步骤 1 中匹配的原始搜索词。

因此,给定以下数据框:

             items
1             alex
2 alex is a person
3   this is a test
4            false
5    this is cathy

以及以下搜索词列表:

"alex"      "bob"       "cathy"     "derrick"   "erica"     "ferdinand"

我想创建以下输出:

             items matches original
1             alex    TRUE     alex
2 alex is a person    TRUE     alex
3   this is a test   FALSE     <NA>
4            false   FALSE     <NA>
5    this is cathy    TRUE     cathy

第 1 步相当简单,但我在第 (2) 步遇到问题。要创建“匹配”列,我使用grepl()创建一个变量,TRUE如果其中的行在d$items搜索词列表中,FALSE否则。

对于第 2 步,我的想法是我应该能够grep()在指定时使用value = T,如下面的代码所示。但是,这会返回错误的值:它不是返回由 grep 匹配的原始搜索词,而是返回匹配项的值。所以我得到以下输出:

            items matches original
1             alex    TRUE     alex
2 alex is a person    TRUE     alex is a person
3   this is a test   FALSE     <NA>
4            false   FALSE     <NA>
5    this is cathy    TRUE     this is cathy

这是我现在正在使用的代码。任何想法将不胜感激!

# Dummy data and search terms
d = data.frame(items = c("alex", "alex is a person", "this is a test", "false", "this is cathy"))
searchTerms = c("alex", "bob", "cathy", "derrick", "erica", "ferdinand")

# Return true iff search term is found in items column, not between letters
d$matches = grepl(paste("(^| |[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVQXYZ])", 
    searchTerms, "($| |[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVQXYZ])", sep = "", 
    collapse = "|"), d[,1], ignore.case = TRUE
)

# Subset data
dMatched = d[d$matches==T,]   

# This is where the problem is: return the value that was originally matched with grepl above
dMatched$original = grep(paste("(^| |[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVQXYZ])", 
    searchTerms, "($| |[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVQXYZ])", sep = "", 
    collapse = "|"), dMatched[,1], ignore.case = TRUE, value = TRUE
)


d$original[d$matches==T] = dMatched$original
4

2 回答 2

3

感谢 Dason 提供的有用提示!我能够通过使用来解决我的问题regmatches()。这是我的代码,从最初的问题开始:

# This is where the problem is: return the value that was originally matched with grepl above
m = regexpr(paste("(^| |[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVQXYZ])", 
    searchTerms, "($| |[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVQXYZ])", sep = "", 
    collapse = "|"), dMatched[,1], ignore.case = TRUE 
)

dMatched$original = regmatches(dMatched[,1], m)

d$original[d$matches==T] = dMatched$original

这将返回以下输出,这正是我想要的:

             items matches original
1             alex    TRUE     alex
2 alex is a person    TRUE    alex 
3   this is a test   FALSE     <NA>
4            false   FALSE     <NA>
5    this is cathy    TRUE    cathy
于 2013-05-09T19:26:03.263 回答
2

不完全是你想要的,但你可以使用qdap'termco函数来做到这一点。如果您在同一个句子中有两个名字,这将有所帮助:

library(qdap)
termco(d$items, 1:nrow(d), searchTerms)

## > termco(d$items, 1:nrow(d), searchTerms)
##   nrow(d word.count       alex bob     cathy derrick erica ferdinand
## 1      1          1 1(100.00%)   0         0       0     0         0
## 2      2          4  1(25.00%)   0         0       0     0         0
## 3      3          4          0   0         0       0     0         0
## 4      4          1          0   0         0       0     0         0
## 5      5          3          0   0 1(33.33%)       0     0         0

要使用 qdap 获得你想要的东西,你可以使用:

dat <- termco(d$items, 1:nrow(d), searchTerms)$raw
terms <- character()

for (i in 3:ncol(dat)){
    terms <- paste(terms, ifelse(dat[, i] == 1, colnames(dat)[i], ""))
}

d$matches <- as.logical(rowSums(dat[, -c(1:2)]))
x <- gsub(" ", ", ", clean(trim(terms)))
d$original <- replacer(x, "", NA)

## > d
##              items matches original
## 1             alex    TRUE     alex
## 2 alex is a person    TRUE     alex
## 3   this is a test   FALSE     <NA>
## 4            false   FALSE     <NA>
## 5    this is cathy    TRUE    cathy
于 2013-05-09T19:54:55.323 回答