php - PHP中的单词边界

Question

在 PHP中，字母前后的变音符号构成单词边界 ( )\b，这不是所需的行为。在其他编程语言中是否正常？（我知道大多数语言都存在\b和问题\w）我应该如何有效地解决这个问题？

从 Unicode 的角度来看，哪些 Unicode 类别构成了单词边界？

这是一个例子：

<?php
 preg_match_all('#\bج\b#u','مَجْل',$t); // the font of this site does not display diacritics
var_dump($t);

score 1 · Accepted Answer

在 PCRE 中：

\d 任何 \p{Nd} 匹配的字符（十进制数字）

\s 任何 \p{Z} 匹配的字符，加上 HT、LF、FF、CR

\w 任何 \p{L} 或 \p{N} 匹配的字符，加上下划线

根据定义，您可以在 Unicode 模式下\w推断定义。\b因此，即使对于 Åström逻辑上具有两个单词边界的字符串（分解的字符），也会检测到多个单词边界*A*̊*stro*̈*m*。

score 0 · Accepted Answer

这只是一种解决方法：

preg_match_all('#(\p{M}*\p{Arabic}*\p{M}*)*ج(\p{M}*\p{Arabic}*\p{M}*)*#u','مَجْل جميل testجواد',$t); // the font of this site does not display diacritics
print_r(array_filter(array_map('array_filter', $t)));

输出：

Array
(
    [0] => Array
        (
            [0] => مَجْل
            [1] => جميل
            [2] => جواد
        )

)

我发现它\p{M}会匹配teshkil，并且\p{Arabic}会匹配一个阿拉伯字母。

php - PHP中的单词边界

2 回答 2

Related

Reference