31

我想使用 R 的 gsub 从文本中删除除撇号之外的所有标点符号。我对正则表达式相当陌生,但正在学习。

例子:

x <- "I like %$@to*&, chew;: gum, but don't like|}{[] bubble@#^)( gum!?"
gsub("[[:punct:]]", "", as.character(x))

电流输出(不带撇号)

[1] "I like to chew gum but dont like bubble gum"

期望的输出(我希望撇号不要留下)

[1] "I like to chew gum but don't like bubble gum"
4

4 回答 4

42
x <- "I like %$@to*&, chew;: gum, but don't like|}{[] bubble@#^)( gum!?"
gsub("[^[:alnum:][:space:]']", "", x)

[1] "I like to chew gum but don't like bubble gum"

上面的正则表达式更直接。它将所有不是字母数字符号、空格或撇号(插入符号!)的内容替换为空字符串。

于 2012-01-02T07:18:55.610 回答
8

punct您可以使用双重否定从 POSIX 类中排除撇号:

[^'[:^punct:]]

代码:

x <- "I like %$@to*&, chew;: gum, but don't like|}{[] bubble@#^)( gum!?"
gsub("[^'[:^punct:]]", "", x, perl=T)

#[1] "I like to chew gum but don't like bubble gum"

ideone demo

于 2015-10-11T05:07:11.763 回答
7

这是一个例子:

>  gsub("(.*?)($|'|[^[:punct:]]+?)(.*?)", "\\2", x)
[1] "I like to chew gum but don't like bubble gum"
于 2012-01-02T03:32:36.053 回答
5

主要是为了多样化,这里有一个使用gsubfn()同名的好包的解决方案。在这个应用程序中,我喜欢它所允许的解决方案的表现力是:

library(gsubfn)
gsubfn(pattern = "[[:punct:]]", engine = "R",
       replacement = function(x) ifelse(x == "'", "'", ""), 
       x)
[1] "I like to chew gum but don't like bubble gum"

engine = "R"此处需要参数,否则将使用默认的 tcl 引擎。其匹配正则表达式的规则略有不同:例如,如果它用于处理上面的字符串,则需要改为设置pattern = "[[:punct:]$|^]"。感谢 G。格洛腾迪克指出了那个细节。)

于 2012-01-02T05:45:15.600 回答