r - 正则表达式；消除所有标点符号，除了

Question

我有以下正则表达式，可以拆分任何空格或标点符号。如何从中排除 1 个或多个标点符号:punct:？假设我想排除撇号和逗号。我知道我可以明确地使用[all punctuation marks in here]而不是，[[:punct:]]但我希望有一种排除方法。

X <- "I'm not that good at regex yet, but am getting better!"
strsplit(X, "[[:space:]]|(?=[[:punct:]])", perl=TRUE)

 [1] "I"       "'"       "m"       "not"     "that"    "good"    "at"      "regex"   "yet"    
[10] ","       ""        "but"     "am"      "getting" "better"  "!"

score 9 · Accepted Answer

我不清楚你想要的结果是什么，但你也许可以使用像这个答案这样的否定类。

R> strsplit(X, "[[:space:]]|(?=[^,'[:^punct:]])", perl=TRUE)[[1]]
 [1] "I'm"     "not"     "that"    "good"    "at"      "regex"   "yet,"   
 [8] "but"     "am"      "getting" "better"  "!"

score 0 · Accepted Answer

如果右侧的下一个字符是or ，您可以直接使用(?![',]) 否定前瞻对 PCRE 子模式施加限制，该匹配失败：',

[[:space:]]|(?=(?![',])[[:punct:]])
               ^^^^^^^^

请参阅正则表达式演示。

细节

[[:space:]]- 任何空格
|- 或者
(?=(?![',])[[:punct:]])- 一个积极的前瞻，要求在当前位置的右侧没有'and,并且有任何 1 个不是'or的标点符号（实际上，需要除and之外的,任何标点符号）。',

查看R 在线演示

X <- "I'm not that good at regex yet, but am getting better!"
strsplit(X, "[[:space:]]|(?=(?![',])[[:punct:]])", perl=TRUE)
[[1]]
 [1] "I'm"     "not"     "that"    "good"    "at"      "regex"   "yet,"   
 [8] "but"     "am"      "getting" "better"  "!"

r - 正则表达式；消除所有标点符号，除了

2 回答 2

Related

Reference