regex - Perl 中 CamelCase (WikiWord) 的 Utf8 正确正则表达式

Question

这是一个关于CamelCase 正则表达式的问题。结合tchrist post我想知道什么是正确的utf-8 CamelCase。

从 (brian d foy's) 正则表达式开始：

/
    \b          # start at word boundary
    [A-Z]       # start with upper
    [a-zA-Z]*   # followed by any alpha

    (?:  # non-capturing grouping for alternation precedence
       [a-z][a-zA-Z]*[A-Z]   # next bit is lower, any zero or more, ending with upper
          |                     # or 
       [A-Z][a-zA-Z]*[a-z]   # next bit is upper, any zero or more, ending with lower
    )

    [a-zA-Z]*   # anything that's left
    \b          # end at word 
/x

并修改为：

/
    \b          # start at word boundary
    \p{Uppercase_Letter}     # start with upper
    \p{Alphabetic}*          # followed by any alpha

    (?:  # non-capturing grouping for alternation precedence
       \p{Lowercase_Letter}[a-zA-Z]*\p{Uppercase_Letter}   ### next bit is lower, any zero or more, ending with upper
          |                  # or 
       \p{Uppercase_Letter}[a-zA-Z]*\p{Lowercase_Letter}   ### next bit is upper, any zero or more, ending with lower
    )

    \p{Alphabetic}*          # anything that's left
    \b          # end at word 
/x

标有“###”的行有问题。

另外，假设数字和下划线等价于小写字母时如何修改正则表达式，因此 W2X3 是有效的 CamelCase 单词。

更新：（ysth评论）

接下来，

any: 意思是“大写或小写或数字或下划线”

正则表达式应匹配 CamelWord、CaW

以大写字母开头
可选任何
小写字母或数字或下划线
可选任何
大写字母
可选任何

请不要标记为重复，因为它不是。最初的问题（以及答案）只考虑 ascii。

score 5 · Accepted Answer

我真的不知道你想做什么，但这应该更接近你最初的意图。不过，我仍然无法说出你的意思。

m{
    \b
    \p{Upper}      #  start with uppercase code point (NOT LETTER)

    \w*            #  optional ident chars 

    # note that upper and lower are not related to letters
    (?:  \p{Lower} \w* \p{Upper}
      |  \p{Upper} \w* \p{Lower}
    )

    \w*

    \b
}x

永远不要使用[a-z]. 事实上，不要使用\p{Lowercase_Letter}or \p{Ll}，因为它们与更可取和更正确的\p{Lowercase}and不同\p{Lower}。

请记住，这\w实际上只是一个别名

[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Letter_Number}\p{Connector_Punctuation}]

regex - Perl 中 CamelCase (WikiWord) 的 Utf8 正确正则表达式

1 回答 1

Related

Reference