由于这个问题超出了 Perl,我很想知道它是如何进行的。在支持本机正则表达式的流行编程语言 Perl、PHP、Python、Ruby、Java 和 Javascript 上对此进行测试,得出的结论是:
[a-z]
将始终匹配每种语言中的ASCII-7 az 范围,并且区域设置不会以任何方式影响它。字符喜欢ž
并且š
从不匹配。
\w
可能匹配也可能不匹配字符ž
and š
,具体取决于创建正则表达式时给出的编程语言和参数。对于这个表达式,多样性是最大的,因为在某些语言中它们永远不会匹配,与选项无关,在其他语言中它们总是匹配,而在某些语言中则取决于。
- POSIX
[[:alpha:]]
和 Unicode\p{Alpha}
和\p{L}
,如果相关编程语言的正则表达式系统支持它们并且使用了适当的配置,则将匹配像ž
和之类的字符š
。
请注意,“适当的配置”不需要更改语言环境:更改语言环境对任何测试系统的结果都没有影响。
为了安全起见,我还测试了命令行 Perl、grep 和 awk。从那里开始,命令行 Perl 的行为与上述相同。但是,我所使用的 grep 和 awk 版本似乎与其他版本的行为不同,对于他们来说,语言环境也很重要[a-z]
。行为是特定于版本和实现的,这些工具的最新版本不会表现出相同的行为。
在那种情况下——grep、awk 或类似的命令行工具——我同意使用a-z
没有语言环境定义的范围可能被认为是一个错误,因为你真的不知道你最终会得到什么。
如果我们查看每种语言的更多详细信息,则状态似乎是:
爪哇
在java中,如果没有指定unicode\p{Alpha}
类[a-z]
,如果是unicode字母字符,则匹配ž
。\w
将匹配字符,例如ž
是否存在 unicode 标志,如果不存在则不匹配,并且\p{L}
无论 unicode 标志如何都将匹配。没有区域感知正则表达式或对[[alpha]]
.
PHP
在 PHP\w
中,如果存在 unicode 开关[[:alpha:]]
,\p{L}
将匹配字符,ž
如果不存在则不匹配。\p{Alpha}
不支持。语言环境对正则表达式没有影响。
Python
\w
如果存在 unicode 标志并且不存在语言环境标志,则将匹配提到的字符。对于 unicode 字符串,如果使用 Python 3,则默认假定 unicode 标志,但不使用 Python 2。Python不支持Unicode\p{Alpha}
或\p{L}
POSIX 。[[:alpha:]]
使用特定于语言环境的正则表达式的修饰符显然仅适用于每个字符 1 个字节的字符集,因此无法用于 unicode。
Perl
\w
除了匹配之外,还匹配前面提到的字符[a-z]
。支持 Unicode\p{Letter}
和\p{Alpha}
POSIX[[:alpha:]]
并按预期工作。正则表达式的 Unicode 和语言环境标志没有改变结果,也没有改变语言环境或use locale;
/ no locale;
。
如果我们使用命令行 Perl 运行测试,行为不会改变。
红宝石
[a-z]
并\w
仅检测字符[a-z]
,与选项无关。支持 Unicode\p{Letter}
和\p{Alpha}
POSIX[[:alpha:]]
并按预期工作。语言环境没有影响。
Javascript
[a-z]
并且\w
总是只检测字符[a-z]
。ECMA2015 中有对/u
unicode 切换的支持,目前主流浏览器大多支持,但[[:alpha:]]
不\p{Alpha}
带来\p{L}
对\w
. unicode 开关确实添加了将 unicode 字符视为一个字符的处理,这在之前一直是个问题。
客户端 javascript 和 Node.js 的情况相同。
AWK
对于 AWK,在文章A.8 Regexp Ranges and Locales: A Long Sad Story中发布了对状态的更长描述。它详细说明了在旧的 unix 工具世界中,[a-z]
检测小写字母的正确方法是当时的工具的工作原理。但是,1992 年 POSIX 引入了语言环境,并更改了字符类的解释,以便按照排序顺序定义字符顺序,并将其绑定到语言环境。这也被当时的 AWK(3.x 系列)采用,这导致了几个问题。开发 4.x 系列时,POSIX 2008 已将顺序定义为未定义,维护者恢复为原始行为。
现在大多使用 AWK 的 4.x 版本。使用时,[a-z]
匹配 az 忽略任何语言环境更改,\w
并且[[:alpha:]]
将匹配特定于语言环境的字符。不支持 Unicode \p{Alpha} 和 \p{L}。
grep
Grep(以及 sed、ed)使用 GNU 基本正则表达式,这是一种古老的风格。它不支持 unicode 字符类。
至少 gnu grep 2.16 和 2.25 似乎遵循 1992 posix,因为该语言环境对于[a-z]
, 以及 for\w
和也很重要[[:alpha:]]
。这意味着例如,如果使用爱沙尼亚语言环境,[az] 仅匹配集合 xuzvöä 中的 z。此行为不会影响旧版本或新版本的 gnu grep,但我不确定哪些版本完全改变了行为。
下面列出的每种语言使用的测试代码。
Java (1.8.0_131)
import java.util.regex.*;
import java.util.Locale;
public class RegExpTest {
public static void main(String args[]) {
verify("v", 118);
verify("š", 353);
verify("ž", 382);
tryWith("v");
tryWith("š");
tryWith("ž");
}
static void tryWith(String input) {
matchWith("[a-z]", input);
matchWith("\\w", input);
matchWith("\\p{Alpha}", input);
matchWith("\\p{L}", input);
matchWith("[[:alpha:]]", input);
}
static void matchWith(String pattern, String input) {
printResult(Pattern.compile(pattern), input);
printResult(Pattern.compile(pattern, Pattern.UNICODE_CHARACTER_CLASS), input);
}
static void printResult(Pattern pattern, String input) {
System.out.printf("%s\t%03d\t%5s\t%-10s\t%-10s\t%-5s%n",
input, input.codePointAt(0), Locale.getDefault(),
specialFlag(pattern.flags()),
pattern, pattern.matcher(input).matches());
}
static String specialFlag(int flags) {
if ((flags & Pattern.UNICODE_CHARACTER_CLASS) == Pattern.UNICODE_CHARACTER_CLASS) {
return "UNICODE_FLAG";
}
return "";
}
static void verify(String str, int code) {
if (str.codePointAt(0) != code) {
throw new RuntimeException("your editor is not properly configured for this character: " + str);
}
}
}
PHP (7.1.5)
<?php
/*
PHP, even with 7, only has binary strings that can be operated with unicode-aware
functions, if needed. So functions operating them need to be told which charset to use.
When there is encoding assumed and not specified, PHP defaults to ISO-8859-1.
*/
// PHP7 and extension=php_intl.dll enabled in PHP.ini is needed for IntlChar class
function codepoint($char) {
return IntlChar::ord($char);
}
function verify($inputp, $code) {
if (codepoint($inputp) != $code) {
throw new Exception(sprintf('Your editor is not configured correctly for %s (result %s, should be %s)',
$inputp, codepoint($inputp), $code));
}
}
$rowindex = 0;
$origlocale = getlocale();
verify('v', 118);
verify('š', 353); // https://en.wikipedia.org/wiki/%C5%A0#Computing_code
verify('ž', 382); // https://en.wikipedia.org/wiki/%C5%BD#Computing_code
function tryWith($input) {
matchWith('[a-z]', $input);
matchWith('\\w', $input);
matchWith('[[:alpha:]]', $input); // POSIX, http://www.regular-expressions.info/posixbrackets.html
matchWith('\p{L}', $input);
}
function matchWith($pattern, $input) {
global $origlocale;
selectLocale($origlocale);
printResult("/^$pattern\$/", $input);
printResult("/^$pattern\$/u", $input);
selectLocale('C'); # default (root) locale
printResult("/^$pattern\$/", $input);
printResult("/^$pattern\$/u", $input);
selectLocale(['et_EE', 'et_EE.UTF-8', 'Estonian_Estonia.1257']);
printResult("/^$pattern\$/", $input);
printResult("/^$pattern\$/u", $input);
selectLocale($origlocale);
}
function selectLocale($locale) {
if (!is_array($locale)) {
$locale = [$locale];
}
// On Windows, no UTF-8 locale can be set
// https://stackoverflow.com/a/16120506/365237
// https://msdn.microsoft.com/en-us/library/x99tb11d.aspx
// Available Windows locales
// https://docs.moodle.org/dev/Table_of_locales
$retval = setlocale(LC_ALL, $locale);
//printf("setting locale %s, retval was %s\n", join(',', $locale), $retval);
if ($retval === false || $retval === null) {
throw new Exception(sprintf('Setting locale %s failed', join(',', $locale)));
}
}
function getlocale() {
return setlocale(LC_ALL, 0);
}
function printResult($pattern, $input) {
global $rowindex;
printf("%2d: %s\t%03d\t%-20s\t%-25s\t%-10s\t%-5s\n",
$rowindex, $input, codepoint($input), getlocale(),
specialFlag($pattern),
$pattern, (preg_match($pattern, $input) === 1)?'true':'false');
$rowindex = $rowindex + 1;
}
function specialFlag($pattern) {
$arr = explode('/',$pattern);
$lastelem = array_pop($arr);
if (strpos($lastelem, 'u') !== false) {
return 'UNICODE';
}
return '';
}
tryWith('v');
tryWith('š');
tryWith('ž');
蟒蛇(3.5.3)
# -*- coding: utf-8 -*-
# with python, there are two strings: unicode strings and regular ones.
# when you use unicode strings, regular expressions also take advantage of it,
# so no need to tell that separately. However, if you want to be using specific
# locale, that you need to tell.
# Note that python3 regexps defaults to unicode mode if unicode regexp string is used,
# python2 does not. Also strings are unicode strings in python3 by default.
# summary: [a-z] is always [a-z], \w will match if unicode flag is present and
# locale flag is not present, no unicode \p{Letter} or POSIX :alpha: exists.
# Letters outside ascii-7 never match \w if locale-specific
# regexp is used, as it only supports charsets with one byte per character
# (https://lists.gt.net/python/python/850772).
# Note that in addition to standard https://docs.python.org/3/library/re.html, more
# complete https://pypi.python.org/pypi/regex/ third-party regexp library exists.
import re, locale
def verify(inputp, code):
if (ord(inputp[0]) != code):
raise Exception('Your editor is not configured correctly for %s (result %s)' % (inputp, ord(inputp[0])))
return
rowindex = 0
origlocale = locale.getlocale(locale.LC_ALL)
verify(u'v', 118)
verify(u'š', 353)
verify(u'ž', 382)
def tryWith(input):
matchWith(u'[a-z]', input)
matchWith(u'\\w', input)
def matchWith(pattern, input):
global origlocale
locale.setlocale(locale.LC_ALL, origlocale)
printResult(re.compile(pattern), input)
printResult(re.compile(pattern, re.UNICODE), input)
printResult(re.compile(pattern, re.UNICODE | re.LOCALE), input)
matchWith2(pattern, input, 'C') # default (root) locale
matchWith2(pattern, input, 'et_EE')
matchWith2(pattern, input, 'et_EE.UTF-8')
matchWith2(pattern, input, 'Estonian_Estonia.1257') # Windows locale
locale.setlocale(locale.LC_ALL, origlocale)
def matchWith2(pattern, input, localeParam):
try:
locale.setlocale(locale.LC_ALL, localeParam) # default (root) locale
printResult(re.compile(pattern), input)
printResult(re.compile(pattern, re.UNICODE), input)
printResult(re.compile(pattern, re.UNICODE | re.LOCALE), input)
except locale.Error:
print("Locale %s not supported on this platform" % localeParam)
def printResult(pattern, input):
global rowindex
try:
print("%2d: %s\t%03d\t%-20s\t%-25s\t%-10s\t%-5s" % \
(rowindex, input, ord(input[0]), locale.getlocale(), \
specialFlag(pattern.flags), \
pattern.pattern, pattern.match(input) != None))
except UnicodeEncodeError:
print("%2d: %s\t%03d\t%-20s\t%-25s\t%-10s\t%-5s" % \
(rowindex, '?', ord(input[0]), locale.getlocale(), \
specialFlag(pattern.flags), \
pattern.pattern, pattern.match(input) != None))
rowindex = rowindex + 1
def specialFlag(flags):
ret = []
if ((flags & re.UNICODE) == re.UNICODE):
ret.append("UNICODE_FLAG")
if ((flags & re.LOCALE) == re.LOCALE):
ret.append("LOCALE_FLAG")
return ','.join(ret)
tryWith(u'v')
tryWith(u'š')
tryWith(u'ž')
Perl (v5.22.3)
# Summary: [a-z] is always [a-z], \w always seems to recognize given test chars and
# unicode \p{Letter}, \p{Alpha} and POSIX :alpha: are supported.
# Unicode and locale flags for regular expression didn't matter in this use case.
use warnings;
use strict;
use utf8;
use v5.14;
use POSIX qw(locale_h);
use Encode;
binmode STDOUT, "utf8";
sub codepoint {
my $inputp = $_[0];
return unpack('U*', $inputp);
}
sub verify {
my($inputp, $code) = @_;
if (codepoint($inputp) != $code) {
die sprintf('Your editor is not configured correctly for %s (result %s)', $inputp, codepoint($inputp))
}
}
sub getlocale {
return setlocale(LC_ALL);
}
my $rowindex = 0;
my $origlocale = getlocale();
verify('v', 118);
verify('š', 353);
verify('ž', 382);
# printf('orig locale is %s', $origlocale);
sub tryWith {
my ($input) = @_;
matchWith('[a-z]', $input);
matchWith('\w', $input);
matchWith('[[:alpha:]]', $input);
matchWith('\p{Alpha}', $input);
matchWith('\p{L}', $input);
}
sub matchWith {
my ($pattern, $input) = @_;
my @locales_to_test = ($origlocale, 'C','C.UTF-8', 'et_EE.UTF-8', 'Estonian_Estonia.UTF-8');
for my $testlocale (@locales_to_test) {
use locale;
# printf("Testlocale %s\n", $testlocale);
setlocale(LC_ALL, $testlocale);
printResult($pattern, $input, '');
printResult($pattern, $input, 'u');
printResult($pattern, $input, 'l');
printResult($pattern, $input, 'a');
};
no locale;
setlocale(LC_ALL, $origlocale);
printResult($pattern, $input, '');
printResult($pattern, $input, 'u');
printResult($pattern, $input, 'l');
printResult($pattern, $input, 'a');
}
sub printResult{
no warnings 'locale';
# for this test, as we want to be able to test non-unicode-compliant locales as well
# remove this for real usage
my ($pattern, $input, $flags) = @_;
my $regexp = qr/$pattern/;
$regexp = qr/$pattern/u if ($flags eq 'u');
$regexp = qr/$pattern/l if ($flags eq 'l');
printf("%2d: %s\t%03d\t%-20s\t%-25s\t%-10s\t%-5s\n",
$rowindex, $input, codepoint($input), getlocale(),
$flags, $pattern, (($input =~ $regexp) ? 'true':'false'));
$rowindex = $rowindex + 1;
}
tryWith('v');
tryWith('š');
tryWith('ž');
Ruby (ruby 2.2.6p396 (2016-11-15 修订版 56800) [x64-mingw32])
# -*- coding: utf-8 -*-
# Summary: [a-z] and \w are always [a-z], unicode \p{Letter}, \p{Alpha} and POSIX
# :alpha: are supported. Locale does not have impact.
# Ruby doesn't seem to be able to interact very well with locale without 'locale'
# rubygem (https://github.com/mutoh/locale), so that is used.
require 'rubygems'
require 'locale'
def verify(inputp, code)
if (inputp.unpack('U*')[0] != code)
raise Exception, sprintf('Your editor is not configured correctly for %s (result %s)', inputp, inputp.unpack('U*')[0])
end
end
$rowindex = 0
$origlocale = Locale.current
$origcharmap = Encoding.locale_charmap
verify('v', 118)
verify('š', 353)
verify('ž', 382)
# printf('orig locale is %s.%s', $origlocale, $origcharmap)
def tryWith(input)
matchWith('[a-z]', input)
matchWith('\w', input)
matchWith('[[:alpha:]]', input)
matchWith('\p{Alpha}', input)
matchWith('\p{L}', input)
end
def matchWith(pattern, input)
locales_to_test = [$origlocale, 'C', 'et_EE', 'Estonian_Estonia']
for testlocale in locales_to_test
Locale.current = testlocale
printResult(Regexp.new(pattern), input)
printResult(Regexp.new(pattern.force_encoding('utf-8'),Regexp::FIXEDENCODING), input)
end
Locale.current = $origlocale
end
def printResult(pattern, input)
printf("%2d: %s\t%03d\t%-20s\t%-25s\t%-10s\t%-5s\n",
$rowindex, input, input.unpack('U*')[0], Locale.current,
specialFlag(pattern),
pattern, !pattern.match(input).nil?)
$rowindex = $rowindex + 1
end
def specialFlag(pattern)
return pattern.encoding
end
tryWith('v')
tryWith('š')
tryWith('ž')
Javascript (node.js) (v6.10.3)
function match(pattern, input) {
try {
var re = new RegExp(pattern, "u");
return input.match(re) !== null;
} catch(e) {
return 'unsupported';
}
}
function regexptest() {
var chars = [
String.fromCodePoint(118),
String.fromCodePoint(353),
String.fromCodePoint(382)
];
for (var i = 0; i < chars.length; i++) {
var char = chars[i];
console.log(
char
+'\t'
+ char.codePointAt(0)
+'\t'
+(match("[a-z]", char))
+'\t'
+(match("\\w", char))
+'\t'
+(match("[[:alpha:]]", char))
+'\t'
+(match("\\p{Alpha}", char))
+'\t'
+(match("\\p{L}", char))
);
}
}
regexptest();
Javascript(网络浏览器)
function match(pattern, input) {
try {
var re = new RegExp(pattern, "u");
return input.match(re) !== null;
} catch(e) {
return 'unsupported';
}
}
window.onload = function() {
var chars = [
String.fromCodePoint(118),
String.fromCodePoint(353),
String.fromCodePoint(382)
];
for (var i = 0; i < chars.length; i++) {
var char = chars[i];
var table = document.getElementById('results');
table.innerHTML +=
'<tr><td>' + char
+'</td><td>'
+ char.codePointAt(0)
+'</td><td>'
+(match("[a-z]", char))
+'</td><td>'
+(match("\\w", char))
+'</td><td>'
+(match("[[:alpha:]]", char))
+'</td><td>'
+(match("\\p{Alpha}", char))
+'</td><td>'
+(match("\\p{L}", char))
+'</td></tr>';
}
}
table {
border-collapse: collapse;
}
table td, table th {
border: 1px solid black;
}
table tr:first-child th {
border-top: 0;
}
table tr:last-child td {
border-bottom: 0;
}
table tr td:first-child,
table tr th:first-child {
border-left: 0;
}
table tr td:last-child,
table tr th:last-child {
border-right: 0;
}
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
</head>
<body>
<table id="results">
<tr>
<td>char</td>
<td>codepoint</td>
<td>[a-z]</td>
<td>\w</td>
<td>[[:alpha:]]</td>
<td>\p{Alpha}</td>
<td>\p{L}</td>
</tr>
</table>
</body>
</html>
AWK(GNU awk 4.1.3)
$ echo "xyzvöä" | LC_ALL=C awk '{match($0,"[a-z]+",a)}END{print a[0]}'
xyzv
$ echo "xyzvöä" | LC_ALL=et_EE.utf8 awk '{match($0,"[a-z]+",a)}END{print a[0]}'
xyzv
$ echo "xyzvöä" | LC_ALL=C awk '{match($0,"\\w+",a)}END{print a[0]}'
xyzv
$ echo "xyzvöä" | LC_ALL=et_EE.utf8 awk '{match($0,"\\w+",a)}END{print a[0]}'
xyzvöä
$ echo "xyzvöä" | LC_ALL=C awk '{match($0,"[[:alpha:]]+",a)}END{print a[0]}'
xyzv
$ echo "xyzvöä" | LC_ALL=et_EE.utf8 awk '{match($0,"[[:alpha:]]+",a)}END{print a[0]}'
xyzvöä
AWK(GNU awk 3.1.8)
$ echo "xyzvöä" | LC_ALL=C awk '{match($0,"[a-z]+",a)}END{print a[0]}'
xyzv
$ echo "xyzvöä" | LC_ALL=et_EE.utf8 awk '{match($0,"[a-z]+",a)}END{print a[0]}'
z
$ echo "xyzvöä" | LC_ALL=C awk '{match($0,"\\w+",a)}END{print a[0]}'
xyzv
$ echo "xyzvöä" | LC_ALL=et_EE.utf8 awk '{match($0,"\\w+",a)}END{print a[0]}'
xyzvöä
$ echo "xyzvöä" | LC_ALL=C awk '{match($0,"[[:alpha:]]+",a)}END{print a[0]}'
xyzv
$ echo "xyzvöä" | LC_ALL=et_EE.utf8 awk '{match($0,"[[:alpha:]]+",a)}END{print a[0]}'
xyzvöä
grep (GNU grep 2.10, GNU grep 3.4)
$ echo xuzvöä | LC_ALL=C grep [az]
xuzv öä
$ echo xuzvöä | LC_ALL=et_EE.utf8 grep [az]
xuzv öä
$ echo xuzvöä | LC_ALL=C grep [[:alpha:]]
xuzv öä
$ echo xuzvöä | LC_ALL=et_EE.utf8 grep [[:alpha:]]
xuzv öä
$ echo xuzvöä | LC_ALL=C grep \\w
xuzv öä
$ echo xuzvöä | LC_ALL=et_EE.utf8 grep \\w
xuzv öä
grep(GNU grep 2.16,GNU grep 2.25)
$ echo xuzvöä | LC_ALL=C grep [az]
xuzv öä
$ echo xuzvöä | LC_ALL=et_EE.utf8 grep [az]
xu z vöä
$ echo xuzvöä | LC_ALL=C grep [[:alpha:]]
xuzv öä
$ echo xuzvöä | LC_ALL=et_EE.utf8 grep [[:alpha:]]
许兹沃
$ echo xuzvöä | LC_ALL=C grep \\w
xuzv öä
$ echo xuzvöä | LC_ALL=et_EE.utf8 grep \\w
许兹沃