php - 字符串突变中正则表达式中重音字母的使用

Question

如何修改字符串突变的正则表达式代码，使其也适用于重音字母？例如，reges 中“amor”的字符串突变应该与“āmōr”的突变相同。我试图简单地包含重音字母，例如 ´(?<=[aeiouāēīōūăĕĭŏŭ])´ 但这不起作用。

我的代码：

$hyphenation = '~
(?<=[aeiou]) #each syllable contain a vowel
(?:
    # Muta cum liquida
    ( (?:[bcdfgpt]r | [bcfgp] l | ph [lr] | [cpt] h | qu ) [aeiou] x )
  |
    [bcdfghlmnp-tx]
    (?:
        # ct goes together

        [cp] \K (?=t)
      |
        # two or more consonants are splitted up
        \K (?= [bcdfghlmnp-tx]+ [aeiou]) 
    )   
  |
    # a consonant and a vowel go together
    (?:
        \K (?= [bcdfghlmnp-t] [aeiou])
      | 
        #  "x" goes to the preceding vowel
        x \K (?= [a-z] | (*SKIP)(*F) ) 
    )
  |
    # two vowels are splitted up except ae oe...
    \K (?= [aeiou] (?<! ae | oe | au | que | qua | quo | qui ) ) 
)
~xi';


// hyphention
$result = preg_replace($hyphenation, '-$1', $input);

score 0 · Accepted Answer

在 unicode 中，重音字母可以通过多种方式表示。例如ā可以是 unicode 代码点 U+0101（带有长音的拉丁小写字母 A），但也可以是 U+0061（拉丁小写字母 A）和 U+0304（组合长音）的组合。（关联）

因此，写作(?<=[aeiouāēīōūăĕĭŏŭ])是正确的，如果：

您使用 u 修饰符通知 pcre 正则表达式引擎您的字符串和模式必须被读取为 UTF-8 字符串。否则，多字节字符被视为单独的字节，而不是原子的（这可能会出现问题并产生奇怪的结果，特别是当多字节字符位于字符类中时。例如[eā]+将匹配“ē”）。
您确定目标字符串和模式对每个字母使用相同的形式。如果模式使用 U+0101 和字符串 U+0061 和 U+0304 表示“ā”，它将不起作用。为防止出现此问题，您可以应用$str = Normalizer::normalize($str);到主题字符串。此方法来自intl扩展。

您可以通过以下链接找到更多信息：

https://en.wikipedia.org/wiki/Unicode_equivalence
http://utf8-chartable.de/
http://php.net/manual/en/normalizer.normalize.php
http://php.net/manual/ en/reference.pcre.pattern.modifiers.php
http://pcre.org/original/pcre.txt

php - 字符串突变中正则表达式中重音字母的使用

1 回答 1

Related

Reference