3

我想从多个字符向量中删除多个模式。目前我要去:

a.vector <- gsub("@\\w+", "", a.vector)
a.vector <- gsub("http\\w+", "", a.vector)
a.vector <- gsub("[[:punct:]], "", a.vector)

等等等等

这很痛苦。我在看这个问题和答案:R:gsub,模式=向量和替换=向量,但这并没有解决问题。

themapply和 themgsub都不起作用。我制作了这些矢量

remove <- c("@\\w+", "http\\w+", "[[:punct:]]")
substitute <- c("")

既不mapply(gsub, remove, substitute, a.vector)也不mgsub(remove, substitute, a.vector) worked.

a.vector看起来像这样:

[4951] "@karakamen: Suicide amongst successful men is becoming rampant. Kudos for staing the conversation. #mental"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
[4952] "@stiphan: you are phenomenal.. #mental #Writing. httptxjwufmfg"   

我想:

[4951] "Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
[4952] "you are phenomenal #mental #Writing"   `
4

4 回答 4

7

我知道这个答案在现场很晚,但它源于我不喜欢手动列出grep函数内的删除模式(请参阅此处的其他解决方案)。regex我的想法是预先设置模式,将它们保留为字符向量,然后使用分隔符粘贴它们(即“需要”时)"|"

library(stringr)

remove <- c("@\\w+", "http\\w+", "[[:punct:]]")

a.vector <- str_remove_all(a.vector, paste(remove, collapse = "|"))

是的,这确实与此处的其他一些答案一样有效,但我认为我的解决方案允许您保留原始的“字符删除向量” remove

于 2019-06-03T04:48:50.173 回答
5

Try combining your subpatterns using |. For example

>s<-"@karakamen: Suicide amongst successful men is becoming rampant. Kudos for staing the conversation. #mental"
> gsub("@\\w+|http\\w+|[[:punct:]]", "", s)
[1] " Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"

But this could become problematic if you have a large number of patterns, or if the result of applying one pattern creates matches to others.

Consider creating your remove vector as you suggested, then applying it in a loop

> s1 <- s
> remove<-c("@\\w+","http\\w+","[[:punct:]]")
> for (p in remove) s1 <- gsub(p, "", s1)
> s1
[1] " Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"

This approach will need to be expanded to apply it to the entire table or vector, of course. But if you put it into a function which returns the final string, you should be able to pass that to one of the apply variants

于 2015-03-13T16:35:10.297 回答
1

如果您正在寻找的多个模式是固定的并且不会因情况而异,您可以考虑创建一个连接的正则表达式,将所有模式组合成一个超级正则表达式模式。

对于您提供的示例,您可以尝试:

removePat <- "(@\\w+)|(http\\w+)|([[:punct:]])"

a.vector <- gsub(removePat, "", a.vector)
于 2015-10-24T00:15:10.537 回答
-1

我有一个带有“我的最终得分”声明的向量,我想保留最终这个词并删除其余部分。根据玛丽安的建议,这对我有用:

str_remove_all("我的最终分数", "我的 |score")

注意:“我的最终成绩”只是一个例子。我正在处理一个向量。

于 2019-10-05T12:50:08.083 回答