regex - 带有 stringi/ICU 的 R/regex：为什么“+”被认为是非 [:punct:] 字符？

Question

我正在尝试从字符串向量中删除非字母字符。我以为[:punct:]分组会覆盖它，但它似乎忽略了+. 这是否属于另一组角色？

library(stringi)
string1 <- c(
"this is a test"
,"this, is also a test"
,"this is the final. test"
,"this is the final + test!"
)

string1 <- stri_replace_all_regex(string1, '[:punct:]', ' ')
string1 <- stri_replace_all_regex(string1, '\\+', ' ')

score 19 · Accepted Answer

POSIX 字符类需要包装在字符类中，正确的形式是 [[:punct:]]. 不要将 POSIX 术语“字符类”与通常所说的正则表达式字符类混淆。

此 POSIX 命名类在 ASCII 范围内匹配所有非控件、非字母数字、非空格字符。

ascii <- rawToChar(as.raw(0:127), multiple=T)
paste(ascii[grepl('[[:punct:]]', ascii)], collapse="")
# [1] "!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~"

虽然如果 a locale 生效，它可能会改变 [[:punct:]]...

R 文档?regex声明如下：某些命名的字符类别是预定义的。它们的解释取决于语言环境（参见locales）；解释是 POSIX 语言环境的解释。

punct的 Open Group LC_TYPE 定义说：

定义要分类为标点字符的字符。

在 POSIX locale<space>中，不应包含 alpha、digit 或 cntrl 类中的 the或任何字符。

在语言环境定义文件中，不应为关键字upper、lower、alpha、digit、cntrl、xdigit 或as<space>指定任何字符。

然而，stringi 包似乎依赖于ICU，而 locale 是 ICU 中的一个基本概念。

使用 stringi 包，我建议使用Unicode 属性\p{P}和\p{S}.

\p{P}匹配任何类型的标点符号。也就是说，它缺少 POSIX 类punct包含的九个字符。这是因为 Unicode 将 POSIX 认为的标点符号分为两类，标点符号和符号。这是\p{S}到位的地方......
```
stri_replace_all_regex(string1, '[\\p{P}\\p{S}]', ' ')
# [1] "this is a test"            "this  is also a test"     
# [3] "this is the final  test"   "this is the final   test "
```

或者gsub从能够很好地处理这个问题的基础 R 回退。

gsub('[[:punct:]]', ' ', string1)
# [1] "this is a test"            "this  is also a test"     
# [3] "this is the final  test"   "this is the final   test "

score 17 · Accepted Answer

在类 POSIX 正则表达式引擎中，代表与分类函数punct对应的字符类（在类 UNIX 系统上查看）。根据 ISO/IEC 9899:1990 (ISO C90)，该功能测试除空格或为真字符之外的任何打印字符。但是，在 POSIX 设置中，哪些字符属于哪个类的详细信息取决于当前的语言环境。所以这里的课程不会导致可移植代码，有关更多详细信息，请参阅有关 C/POSIX 迁移的 ICU 用户指南。ispunct()man 3 ispunctispunct()isalnum()punct

另一方面，stringi 所依赖的 ICU 库完全符合 Unicode 标准，它以自己的方式定义了一些字符类——但定义明确且始终可移植——方式。

特别是，根据 Unicode 标准，PLUS SIGN( U+002B) 属于Symbol, Math ( Sm) 类别（并且不是Puctuation Mark( P)）。

library("stringi")
ascii <- stri_enc_fromutf32(1:127)
stri_extract_all_regex(ascii, "[[:punct:]]")[[1]]
##  [1] "!"  "\"" "#"  "%"  "&"  "'"  "("  ")"  "*"  ","  "-"  "."  "/"  ":"  ";"  "?"  "@"  "["  "\\" "]"  "_"  "{"  "}" 
stri_extract_all_regex(ascii, "[[:symbol:]]")[[1]]
## [1] "$" "+" "<" "=" ">" "^" "`" "|" "~"

所以在这里你应该使用这样的字符集[[:punct:][:symbol:]]，，[[:punct:]+]甚至更好的[\\p{P}\\p{S}]或 [\\p{P}+]。

有关可用字符类的详细信息，请查看 ?"stringi-search-charclass". 特别是ICU UnicodeSet 用户指南和Unicode 标准附件 #44：Unicode 字符数据库可能会让您感兴趣。高温高压

regex - 带有 stringi/ICU 的 R/regex：为什么“+”被认为是非 [:punct:] 字符？

2 回答 2

Related

Reference