3

如何编写匹配所有有效西班牙语和阿拉伯语单词的正则表达式。
我知道在英语中是a-zA-z,在希伯来语中是א-ת,在俄语中是А-Яа-яёЁ
我使用 Javascript。

4

1 回答 1

16

The range a-zA-Z for English words is unacceptably simple and naïve. It leaves out all manner of letters with accents and other special marks that are used in loan words, etc. For instance, it won't match the word "naïve", from my first sentence. Use the \p{Latin} script, instead.

The range א-ת for Hebrew words is also wrong. It leaves out Hebrew presentation forms, cantillation marks, Yiddish digraphs, and more. Use the \p{Hebrew} script, instead.

The range А-Яа-яёЁ for Russian is again incomplete and wrong. Use the \p{Cyrillic} script, instead.

The Spanish alphabet uses the same 26 letters as English, plus ñÑ. But again, don't hardcode these into a range. Many Spanish words use accented vowels. Use the \p{Latin} script to match Spanish words. Regexes won't help you distinguish Spanish from English.

For Arabic, use the \p{Arabic} script.

JavaScript, regex, and Unicode

You said you're using JavaScript. Unfortunately, JavaScript has very little support for Unicode built-in. In JavaScript, you need to use the XRegExp library and its Unicode addon. That will allow you to use all of the Unicode scripts I mentioned above in your regular expressions.

Scripts vs blocks

Always favor Unicode scripts over Unicode blocks. Blocks match up poorly with the code points in a particular script. Blocks very often leave out many important code points that fall outside of their incomplete range, and include many code points that have not been assigned any character. Scripts include all relevant code points, and no more.

于 2012-06-04T14:46:26.613 回答