我有字符串和字符向量。我想从字符串的开始中找到尽可能多的字符匹配的字符向量中的所有字符串。例如:
s <- "abs"
vc <- c("ab","bb","abc","acbd","dert")
result <- c("ab","abc")
字符串 s 应该与前 K 个字符完全匹配。我希望尽可能匹配(最大 K<=length(s))。这里没有匹配 "abs" (grep("abs",vc)),但是对于 "ab" 有两个匹配 (result <-grep("ab",vc))。
我有字符串和字符向量。我想从字符串的开始中找到尽可能多的字符匹配的字符向量中的所有字符串。例如:
s <- "abs"
vc <- c("ab","bb","abc","acbd","dert")
result <- c("ab","abc")
字符串 s 应该与前 K 个字符完全匹配。我希望尽可能匹配(最大 K<=length(s))。这里没有匹配 "abs" (grep("abs",vc)),但是对于 "ab" 有两个匹配 (result <-grep("ab",vc))。
另一种解释:
s <- "abs"
# Updated vc
vc <- c("ab","bb","abc","acbd","dert","abwabsabs")
st <- strsplit(s, "")[[1]]
mtc <- sapply(strsplit(substr(vc, 1, nchar(s)), ""),
function(i) {
m <- i == st[1:length(i)]
sum(m * cumsum(m))})
vc[mtc == max(mtc)]
#[1] "ab" "abc" "abwabsabs"
# Another vector vc
vc <- c("ab","bb","abc","acbd","dert","absq","abab")
....
vc[mtc == max(mtc)]
#[1] "absq"
由于我们只考虑字符串的开头,在第一种情况下,最长的匹配是"ab"
,即使有"abwabsabs"
which has "abs"
。
编辑:这是一个“单一模式”的解决方案,可能更简洁,但我们开始......
vc <- c("ab","bb","abc","acbd","dert","abwabsabs")
(auxOne <- sapply((nchar(s)-1):1, function(i) substr(s, 1, i)))
#[1] "ab" "a"
(auxTwo <- sapply(nchar(s):2, function(i) substring(s, i)))
#[1] "s" "bs"
l <- attr(regexpr(
paste0("^((",s,")|",paste0("(",auxOne,"(?!",auxTwo,"))",collapse="|"),")"),
vc, perl = TRUE), "match.length")
vc[l == max(l)]
#[1] "ab" "abc" "abwabsabs"
这是一个函数,它使用grep
并检查给定的字符串是否s
与 中的任何字符串的开头匹配vc
,递归地从 的末尾删除一个字符s
:
myfun <- function(s, vc) {
notDone <- TRUE
maxChar <- max(nchar(vc)) # EDIT: these two lines truncate s to
s <- substr(s, 1, maxChar) # the maximum number of chars in vc
subN <- nchar(s)
while(notDone & subN > 0){
ss <- substr(s, 1, subN)
ans <- grep(sprintf("^%s", ss), vc, val = TRUE)
if(length(ans)) {
notDone <- FALSE
} else {
subN <- subN - 1
}
}
return(ans)
}
s <- "abs"
# Updated vc from @Julius's answer
vc <- c("ab","bb","abc","acbd","dert","absq","abab")
> myfun(s, vc)
[1] "absq"
# And there's no infinite recursion if there's no match
> myfun("q", "a")
character(0)
只是在很久之后,triebeard包现在已经存在了。对于查找最长或部分匹配项,它非常非常高效且用户友好。