php - 正则表达式中的希伯来语特殊字符

Question

这是我的代码：

preg_replace('/[^{Hebrew}a-zA-Z0-9_ %\[\]\.\(\)%&-]/s', '', $q);

它应该只接受 az、AZ、0-9、任意数量的单个空格和希伯来语字符。

我尝试了很多变化，但无法让它工作。

提前致谢！

score 5 · Accepted Answer

In PCRE, \p{xx} and \P{xx} can take in either a Unicode category name or Unicode script name. The list can be found in PHP documentation or in PCRE man page.

For Hebrew script, you need to use \p{Hebrew}.

I also remove escape \ for ., (, ), since they already loses their special meaning inside the character class []. The s flag (DOTALL) is useless, since there is no dot metacharacter in your regex.

preg_replace('/[^\p{Hebrew}a-zA-Z0-9_ %\[\].()&-]/', '', $q);

Appendix

From Unicode FAQs. It explains the difference between blocks and scripts. For your information, PCRE only has support for matching Unicode scripts and Unicode categories (character properties).

Q: If Unicode blocks aren't code pages, what are they?

A: Blocks in the Unicode Standard are named ranges of code points. They are used to help organize the standard into groupings of related kinds of characters, for convenience in reference. And they are used by a charting program to define the ranges of characters printed out together for the code charts seen in the book or posted online.

Q: Do Unicode blocks have defined character properties?

A: No. The character properties are associated with encoded characters themselves, rather than the blocks they are encoded in.

Q: Does that even apply to the script for characters?

A: Yes. For example, the Thai block contains Thai characters that have the Thai script property, but it also contains the character for the baht currency sign, which is used in Thai text, of course, but which is defined to have the Common script property. To find the script property value for any character you need to rely on the Unicode Character Database data file, Scripts.txt, rather than the block value alone.

Q: So block value is not the same as script value?

A: Correct. In some cases, such as Latin, the encoded characters are spread across as many as a dozen different Unicode blocks. That is unfortunate, but is simply the result of the history of the standard. In other instances, a single block may contain characters of more than one script. For example, the Greek and Coptic block contains mostly characters of the Greek script, but also a few historic characters of the Coptic script.

score 0 · Accepted Answer

您应该将文件更改为 utf 8 编码，例如：notepad++ 转到编码 -> 编码为 UTF-8。它可以工作：preg_replace('/[^\p{Hebrew}a-zA-Z0-9_ %[].()&-]/u','', $q)我还添加了“u”作为修饰符。

php - 正则表达式中的希伯来语特殊字符

2 回答 2

Appendix

Related

Reference