regex - 我们应该考虑使用 range [az] 作为一个错误吗？

Question

在我的语言环境（et_EE）[a-z]中意味着：

abcdefghijklmnopqrsšz

因此，不包括 6 个 ASCII 字符 ( tuvwxy) 和一个来自爱沙尼亚字母 ( ž) 的字符。我看到很多模块仍在使用正则表达式，例如

/\A[0-9A-Z_a-z]+\z/

对我来说，定义 ASCII 字母数字字符范围的方法似乎是错误的，我认为应该将其替换为：

/\A\p{PosixAlnum}+\z/

第一个仍然被认为是惯用的方式吗？或接受的解决方案？还是一个错误？

或者最后一个有一些警告？

score 8 · Accepted Answer

回到旧的 Perl 3.0 时代，一切都是 ASCII，Perl 反映了这一点。\w与的意思相同[0-9A-Z_a-z]。而且，我们喜欢它！

但是，Perl 不再绑定到 ASCII。我[a-z]不久前就停止使用了，因为当我编写的程序不适用于非英语的语言时，我被大喊大叫。你一定想象过我作为一个美国人会惊讶地发现这个世界上至少有几千人不会说英语。

Perl 有更好的处理方式[0-9A-Z_a-z]。您可以使用[[:alnum:]]set 或简单地使用\wwhich 应该做正确的事情。如果您只能使用小写字符，则可以使用[[:lower:]]代替[a-z]（假定为英语类型的语言）。（Perl 竭尽全力让 [az] 仅表示 26 个字符 a、b、c、...z，即使在 EBCDIC 平台上也是如此。）

如果您只需要指定 ASCII，您可以添加/a限定符。如果您的意思是特定于语言环境，则应在“使用语言环境”的词法范围内编译正则表达式。（避免使用 /l 修饰符，因为它只适用于正则表达式模式，而不适用于其他。例如在 's/[[:lower:]]/\U$&/lg' 中，模式是使用语言环境编译的，但是 \U 不是。这可能应该被认为是 Perl 中的一个错误，但它是当前工作的方式。/l 修饰符实际上仅用于内部簿记，不应直接输入。）实际上，最好在输入到程序时翻译您的语言环境数据，然后在输出时将其翻译回，同时在内部使用 Unicode。如果您的语言环境是新式 UTF-8 语言环境之一，则 5.16 中的一个新功能 'use locale ":not_characters"'

$word =~ /^[[:alnum:]]+$/   # $word contains only Posix alphanumeric characters.
$word =~ /^[[:alnum:]]+$/a  # $word contains only ASCII alphanumeric characters.
{ use locale;
  $word =~ /^[[:alnum:]]+$/;# $word contains only alphanum characters for your locale
}

现在，这是一个错误吗？如果程序没有按预期工作，那是一个简单明了的错误。如果你真的想要 ASCII 序列，[a-z]那么程序员应该使用[[:lower:]]限定符/a。如果您想要所有可能的小写字符，包括其他语言中的小写字符，您应该简单地使用[[:lower:]].

score 8 · Accepted Answer

由于这个问题超出了 Perl，我很想知道它是如何进行的。在支持本机正则表达式的流行编程语言 Perl、PHP、Python、Ruby、Java 和 Javascript 上对此进行测试，得出的结论是：

[a-z]将始终匹配每种语言中的ASCII-7 az 范围，并且区域设置不会以任何方式影响它。字符喜欢ž并且š从不匹配。
\w可能匹配也可能不匹配字符žand š，具体取决于创建正则表达式时给出的编程语言和参数。对于这个表达式，多样性是最大的，因为在某些语言中它们永远不会匹配，与选项无关，在其他语言中它们总是匹配，而在某些语言中则取决于。
POSIX[[:alpha:]]和 Unicode\p{Alpha}和\p{L}，如果相关编程语言的正则表达式系统支持它们并且使用了适当的配置，则将匹配像ž和之类的字符š。

请注意，“适当的配置”不需要更改语言环境：更改语言环境对任何测试系统的结果都没有影响。

为了安全起见，我还测试了命令行 Perl、grep 和 awk。从那里开始，命令行 Perl 的行为与上述相同。但是，我所使用的 grep 和 awk 版本似乎与其他版本的行为不同，对于他们来说，语言环境也很重要[a-z]。行为是特定于版本和实现的，这些工具的最新版本不会表现出相同的行为。

在那种情况下——grep、awk 或类似的命令行工具——我同意使用a-z没有语言环境定义的范围可能被认为是一个错误，因为你真的不知道你最终会得到什么。

如果我们查看每种语言的更多详细信息，则状态似乎是：

爪哇

在java中，如果没有指定unicode\p{Alpha}类[a-z]，如果是unicode字母字符，则匹配ž。\w将匹配字符，例如ž是否存在 unicode 标志，如果不存在则不匹配，并且\p{L}无论 unicode 标志如何都将匹配。没有区域感知正则表达式或对[[alpha]].

PHP

在 PHP\w中，如果存在 unicode 开关[[:alpha:]]，\p{L}将匹配字符，ž如果不存在则不匹配。\p{Alpha}不支持。语言环境对正则表达式没有影响。

Python

\w如果存在 unicode 标志并且不存在语言环境标志，则将匹配提到的字符。对于 unicode 字符串，如果使用 Python 3，则默认假定 unicode 标志，但不使用 Python 2。Python不支持Unicode\p{Alpha}或\p{L}POSIX 。[[:alpha:]]

使用特定于语言环境的正则表达式的修饰符显然仅适用于每个字符 1 个字节的字符集，因此无法用于 unicode。

Perl

\w除了匹配之外，还匹配前面提到的字符[a-z]。支持 Unicode\p{Letter}和\p{Alpha}POSIX[[:alpha:]]并按预期工作。正则表达式的 Unicode 和语言环境标志没有改变结果，也没有改变语言环境或use locale;/ no locale;。

如果我们使用命令行 Perl 运行测试，行为不会改变。

红宝石

[a-z]并\w仅检测字符[a-z]，与选项无关。支持 Unicode\p{Letter}和\p{Alpha}POSIX[[:alpha:]]并按预期工作。语言环境没有影响。

Javascript

[a-z]并且\w总是只检测字符[a-z]。ECMA2015 中有对/uunicode 切换的支持，目前主流浏览器大多支持，但[[:alpha:]]不\p{Alpha}带来\p{L}对\w. unicode 开关确实添加了将 unicode 字符视为一个字符的处理，这在之前一直是个问题。

客户端 javascript 和 Node.js 的情况相同。

AWK

对于 AWK，在文章A.8 Regexp Ranges and Locales: A Long Sad Story中发布了对状态的更长描述。它详细说明了在旧的 unix 工具世界中，[a-z]检测小写字母的正确方法是当时的工具的工作原理。但是，1992 年 POSIX 引入了语言环境，并更改了字符类的解释，以便按照排序顺序定义字符顺序，并将其绑定到语言环境。这也被当时的 AWK（3.x 系列）采用，这导致了几个问题。开发 4.x 系列时，POSIX 2008 已将顺序定义为未定义，维护者恢复为原始行为。

现在大多使用 AWK 的 4.x 版本。使用时，[a-z]匹配 az 忽略任何语言环境更改，\w并且[[:alpha:]]将匹配特定于语言环境的字符。不支持 Unicode \p{Alpha} 和 \p{L}。

grep

Grep（以及 sed、ed）使用 GNU 基本正则表达式，这是一种古老的风格。它不支持 unicode 字符类。

至少 gnu grep 2.16 和 2.25 似乎遵循 1992 posix，因为该语言环境对于[a-z], 以及 for\w和也很重要[[:alpha:]]。这意味着例如，如果使用爱沙尼亚语言环境，[az] 仅匹配集合 xuzvöä 中的 z。此行为不会影响旧版本或新版本的 gnu grep，但我不确定哪些版本完全改变了行为。

下面列出的每种语言使用的测试代码。

Java (1.8.0_131)

import java.util.regex.*;
import java.util.Locale;

public class RegExpTest {
    public static void main(String args[]) {
        verify("v", 118);
        verify("š", 353);
        verify("ž", 382);

        tryWith("v");
        tryWith("š");
        tryWith("ž");
    }
    static void tryWith(String input) {
        matchWith("[a-z]", input);
        matchWith("\\w", input);
        matchWith("\\p{Alpha}", input);
        matchWith("\\p{L}", input);
        matchWith("[[:alpha:]]", input);
    }

    static void matchWith(String pattern, String input) {
        printResult(Pattern.compile(pattern), input);
        printResult(Pattern.compile(pattern, Pattern.UNICODE_CHARACTER_CLASS), input);
    }
    static void printResult(Pattern pattern, String input) {
        System.out.printf("%s\t%03d\t%5s\t%-10s\t%-10s\t%-5s%n",
          input, input.codePointAt(0), Locale.getDefault(),
          specialFlag(pattern.flags()),
          pattern, pattern.matcher(input).matches());
    }
    static String specialFlag(int flags) {
      if ((flags & Pattern.UNICODE_CHARACTER_CLASS) == Pattern.UNICODE_CHARACTER_CLASS) {
          return "UNICODE_FLAG";
      }
      return "";
    }
    static void verify(String str, int code) {
        if (str.codePointAt(0) != code) {
            throw new RuntimeException("your editor is not properly configured for this character: " + str);
        }
    }
}

PHP (7.1.5)

<?php
/*
PHP, even with 7, only has binary strings that can be operated with unicode-aware
functions, if needed. So functions operating them need to be told which charset to use.

When there is encoding assumed and not specified, PHP defaults to ISO-8859-1.
*/


// PHP7 and extension=php_intl.dll enabled in PHP.ini is needed for IntlChar class
function codepoint($char) {
  return IntlChar::ord($char);
}

function verify($inputp, $code) {
  if (codepoint($inputp) != $code) {
    throw new Exception(sprintf('Your editor is not configured correctly for %s (result %s, should be %s)',
      $inputp, codepoint($inputp), $code));
  }
}

$rowindex = 0;
$origlocale = getlocale();

verify('v', 118);
verify('š', 353); // https://en.wikipedia.org/wiki/%C5%A0#Computing_code
verify('ž', 382); // https://en.wikipedia.org/wiki/%C5%BD#Computing_code

function tryWith($input) {
  matchWith('[a-z]', $input);
  matchWith('\\w', $input);
  matchWith('[[:alpha:]]', $input); // POSIX, http://www.regular-expressions.info/posixbrackets.html
  matchWith('\p{L}', $input);
}
function matchWith($pattern, $input) {
  global $origlocale;
  selectLocale($origlocale);
  printResult("/^$pattern\$/", $input);
  printResult("/^$pattern\$/u", $input);
  selectLocale('C'); # default (root) locale
  printResult("/^$pattern\$/", $input);
  printResult("/^$pattern\$/u", $input);
  selectLocale(['et_EE', 'et_EE.UTF-8', 'Estonian_Estonia.1257']);
  printResult("/^$pattern\$/", $input);
  printResult("/^$pattern\$/u", $input);
  selectLocale($origlocale);
}
function selectLocale($locale) {
  if (!is_array($locale)) {
    $locale = [$locale];
  }
  // On Windows, no UTF-8 locale can be set
  // https://stackoverflow.com/a/16120506/365237
  // https://msdn.microsoft.com/en-us/library/x99tb11d.aspx
  // Available Windows locales
  // https://docs.moodle.org/dev/Table_of_locales
  $retval = setlocale(LC_ALL, $locale);
  //printf("setting locale %s, retval was %s\n", join(',', $locale), $retval);
  if ($retval === false || $retval === null) {
    throw new Exception(sprintf('Setting locale %s failed', join(',', $locale)));
  }
}
function getlocale() {
  return setlocale(LC_ALL, 0);
}
function printResult($pattern, $input) {
  global $rowindex;
  printf("%2d: %s\t%03d\t%-20s\t%-25s\t%-10s\t%-5s\n",
        $rowindex, $input, codepoint($input), getlocale(),
        specialFlag($pattern), 
        $pattern, (preg_match($pattern, $input) === 1)?'true':'false');
  $rowindex = $rowindex + 1;
}
function specialFlag($pattern) {
  $arr = explode('/',$pattern);
  $lastelem = array_pop($arr);
  if (strpos($lastelem, 'u') !== false) {
    return 'UNICODE';
  }
  return '';
}

tryWith('v');
tryWith('š');
tryWith('ž');

蟒蛇（3.5.3）

# -*- coding: utf-8 -*-

# with python, there are two strings: unicode strings and regular ones.
# when you use unicode strings, regular expressions also take advantage of it,
# so no need to tell that separately. However, if you want to be using specific
# locale, that you need to tell.

# Note that python3 regexps defaults to unicode mode if unicode regexp string is used,
# python2 does not. Also strings are unicode strings in python3 by default.

# summary: [a-z] is always [a-z], \w will match if unicode flag is present and
# locale flag is not present, no unicode \p{Letter} or POSIX :alpha: exists.
# Letters outside ascii-7 never match \w if locale-specific
# regexp is used, as it only supports charsets with one byte per character
# (https://lists.gt.net/python/python/850772).

# Note that in addition to standard https://docs.python.org/3/library/re.html, more
# complete https://pypi.python.org/pypi/regex/ third-party regexp library exists.

import re, locale

def verify(inputp, code):
  if (ord(inputp[0]) != code):
    raise Exception('Your editor is not configured correctly for %s (result %s)' % (inputp, ord(inputp[0])))
  return

rowindex = 0
origlocale = locale.getlocale(locale.LC_ALL)  

verify(u'v', 118)
verify(u'š', 353)
verify(u'ž', 382)

def tryWith(input):
  matchWith(u'[a-z]', input)
  matchWith(u'\\w', input)

def matchWith(pattern, input):
  global origlocale
  locale.setlocale(locale.LC_ALL, origlocale)
  printResult(re.compile(pattern), input)
  printResult(re.compile(pattern, re.UNICODE), input)
  printResult(re.compile(pattern, re.UNICODE | re.LOCALE), input)

  matchWith2(pattern, input, 'C') # default (root) locale
  matchWith2(pattern, input, 'et_EE')
  matchWith2(pattern, input, 'et_EE.UTF-8')
  matchWith2(pattern, input, 'Estonian_Estonia.1257') # Windows locale
  locale.setlocale(locale.LC_ALL, origlocale)

def matchWith2(pattern, input, localeParam):
  try:
    locale.setlocale(locale.LC_ALL, localeParam) # default (root) locale
    printResult(re.compile(pattern), input)
    printResult(re.compile(pattern, re.UNICODE), input)
    printResult(re.compile(pattern, re.UNICODE | re.LOCALE), input)
  except locale.Error:
    print("Locale %s not supported on this platform" % localeParam)

def printResult(pattern, input):
  global rowindex
  try:
    print("%2d: %s\t%03d\t%-20s\t%-25s\t%-10s\t%-5s" % \
          (rowindex, input, ord(input[0]), locale.getlocale(), \
          specialFlag(pattern.flags), \
          pattern.pattern, pattern.match(input) != None))
  except UnicodeEncodeError:
    print("%2d: %s\t%03d\t%-20s\t%-25s\t%-10s\t%-5s" % \
          (rowindex, '?', ord(input[0]), locale.getlocale(), \
          specialFlag(pattern.flags), \
          pattern.pattern, pattern.match(input) != None))
  rowindex = rowindex + 1      

def specialFlag(flags):
  ret = []
  if ((flags & re.UNICODE) == re.UNICODE):
    ret.append("UNICODE_FLAG")
  if ((flags & re.LOCALE) == re.LOCALE):
    ret.append("LOCALE_FLAG")
  return ','.join(ret)

tryWith(u'v')
tryWith(u'š')
tryWith(u'ž')

Perl (v5.22.3)

# Summary: [a-z] is always [a-z], \w always seems to recognize given test chars and
# unicode \p{Letter}, \p{Alpha} and POSIX :alpha: are supported.
# Unicode and locale flags for regular expression didn't matter in this use case.

use warnings;
use strict;
use utf8;
use v5.14;
use POSIX qw(locale_h);
use Encode;
binmode STDOUT, "utf8";

sub codepoint {
  my $inputp = $_[0];
  return unpack('U*', $inputp);
}
sub verify {
  my($inputp, $code) = @_;
  if (codepoint($inputp) != $code) {
    die sprintf('Your editor is not configured correctly for %s (result %s)', $inputp, codepoint($inputp))
  }
}

sub getlocale {
  return setlocale(LC_ALL);
}
my $rowindex = 0;
my $origlocale = getlocale();

verify('v', 118);
verify('š', 353);
verify('ž', 382);

# printf('orig locale is %s', $origlocale);

sub tryWith {
  my ($input) = @_;
  matchWith('[a-z]', $input);
  matchWith('\w', $input);
  matchWith('[[:alpha:]]', $input);
  matchWith('\p{Alpha}', $input);
  matchWith('\p{L}', $input);
}

sub matchWith {
  my ($pattern, $input) = @_;
  my @locales_to_test = ($origlocale, 'C','C.UTF-8', 'et_EE.UTF-8', 'Estonian_Estonia.UTF-8');
  for my $testlocale (@locales_to_test) {
    use locale;
    # printf("Testlocale %s\n", $testlocale);
    setlocale(LC_ALL, $testlocale);
    printResult($pattern, $input, '');
    printResult($pattern, $input, 'u');
    printResult($pattern, $input, 'l');
    printResult($pattern, $input, 'a');
   };
  no locale;
  setlocale(LC_ALL, $origlocale);
  printResult($pattern, $input, '');
  printResult($pattern, $input, 'u');
  printResult($pattern, $input, 'l');
  printResult($pattern, $input, 'a');
}


sub printResult{
  no warnings 'locale';
              # for this test, as we want to be able to test non-unicode-compliant locales as well
              # remove this for real usage

  my ($pattern, $input, $flags) = @_;
  my $regexp = qr/$pattern/;
  $regexp = qr/$pattern/u if ($flags eq 'u');
  $regexp = qr/$pattern/l if ($flags eq 'l');
  printf("%2d: %s\t%03d\t%-20s\t%-25s\t%-10s\t%-5s\n", 
        $rowindex, $input, codepoint($input), getlocale(),
        $flags, $pattern, (($input =~ $regexp) ? 'true':'false'));
  $rowindex = $rowindex + 1;
}

tryWith('v');
tryWith('š');
tryWith('ž');

Ruby (ruby 2.2.6p396 (2016-11-15 修订版 56800) [x64-mingw32])

# -*- coding: utf-8 -*-

# Summary: [a-z] and \w are always [a-z], unicode \p{Letter}, \p{Alpha} and POSIX
# :alpha: are supported. Locale does not have impact.

# Ruby doesn't seem to be able to interact very well with locale without 'locale'
# rubygem (https://github.com/mutoh/locale), so that is used.

require 'rubygems'
require 'locale'

def verify(inputp, code)
  if (inputp.unpack('U*')[0] != code)
    raise Exception, sprintf('Your editor is not configured correctly for %s (result %s)', inputp, inputp.unpack('U*')[0])
  end
end

$rowindex = 0
$origlocale = Locale.current
$origcharmap = Encoding.locale_charmap

verify('v', 118)
verify('š', 353)
verify('ž', 382)

# printf('orig locale is %s.%s', $origlocale, $origcharmap)
def tryWith(input)
  matchWith('[a-z]', input)
  matchWith('\w', input)
  matchWith('[[:alpha:]]', input)
  matchWith('\p{Alpha}', input)
  matchWith('\p{L}', input)
end  

def matchWith(pattern, input)
  locales_to_test = [$origlocale, 'C', 'et_EE', 'Estonian_Estonia']
  for testlocale in locales_to_test
    Locale.current = testlocale
    printResult(Regexp.new(pattern), input)
    printResult(Regexp.new(pattern.force_encoding('utf-8'),Regexp::FIXEDENCODING), input)
  end
  Locale.current = $origlocale
end

def printResult(pattern, input)
  printf("%2d: %s\t%03d\t%-20s\t%-25s\t%-10s\t%-5s\n", 
        $rowindex, input, input.unpack('U*')[0], Locale.current,
        specialFlag(pattern),
        pattern, !pattern.match(input).nil?)
  $rowindex = $rowindex + 1
end

def specialFlag(pattern)
  return pattern.encoding
end

tryWith('v')
tryWith('š')
tryWith('ž')

Javascript (node.js) (v6.10.3)

function match(pattern, input) {
    try {
        var re = new RegExp(pattern, "u");
        return input.match(re) !== null;
    } catch(e) {
        return 'unsupported';
    }
}
function regexptest() {
    var chars = [
        String.fromCodePoint(118),
        String.fromCodePoint(353),
        String.fromCodePoint(382)
    ];
    for (var i = 0; i < chars.length; i++) {
        var char = chars[i];
        console.log(
            char
            +'\t'
            + char.codePointAt(0)
            +'\t'
            +(match("[a-z]", char))
            +'\t'
            +(match("\\w", char))
            +'\t'
            +(match("[[:alpha:]]", char))
            +'\t'
            +(match("\\p{Alpha}", char))
            +'\t'
            +(match("\\p{L}", char))
            );
    }
}

regexptest();

Javascript（网络浏览器）

function match(pattern, input) {
    try {
        var re = new RegExp(pattern, "u");
        return input.match(re) !== null;
    } catch(e) {
        return 'unsupported';
    }
}
window.onload = function() {
    var chars = [
        String.fromCodePoint(118),
        String.fromCodePoint(353),
        String.fromCodePoint(382)
    ];
    for (var i = 0; i < chars.length; i++) {
        var char = chars[i];
        var table = document.getElementById('results');
        table.innerHTML += 
            '<tr><td>' + char
            +'</td><td>'
            + char.codePointAt(0)
            +'</td><td>'
            +(match("[a-z]", char))
            +'</td><td>'
            +(match("\\w", char))
            +'</td><td>'
            +(match("[[:alpha:]]", char))
            +'</td><td>'
            +(match("\\p{Alpha}", char))
            +'</td><td>'
            +(match("\\p{L}", char))
            +'</td></tr>';
    }
}

table {
    border-collapse: collapse;
}
table td, table th {
    border: 1px solid black;
}
table tr:first-child th {
    border-top: 0;
}
table tr:last-child td {
    border-bottom: 0;
}
table tr td:first-child,
table tr th:first-child {
    border-left: 0;
}
table tr td:last-child,
table tr th:last-child {
    border-right: 0;
}

<!DOCTYPE html> 
<html>
<head>
    <meta charset="utf-8" /> 
</head>
<body>
    <table id="results">
    <tr>
        <td>char</td>
        <td>codepoint</td>
        <td>[a-z]</td>
        <td>\w</td>
        <td>[[:alpha:]]</td>
        <td>\p{Alpha}</td>
        <td>\p{L}</td>
    </tr>
    </table>
</body>
</html>

AWK（GNU awk 4.1.3）

$ echo "xyzvöä" | LC_ALL=C awk '{match($0,"[a-z]+",a)}END{print a[0]}'
xyzv
$ echo "xyzvöä" | LC_ALL=et_EE.utf8 awk '{match($0,"[a-z]+",a)}END{print a[0]}'
xyzv
$ echo "xyzvöä" | LC_ALL=C awk '{match($0,"\\w+",a)}END{print a[0]}'
xyzv
$ echo "xyzvöä" | LC_ALL=et_EE.utf8 awk '{match($0,"\\w+",a)}END{print a[0]}'
xyzvöä
$ echo "xyzvöä" | LC_ALL=C awk '{match($0,"[[:alpha:]]+",a)}END{print a[0]}'
xyzv
$ echo "xyzvöä" | LC_ALL=et_EE.utf8 awk '{match($0,"[[:alpha:]]+",a)}END{print a[0]}'
xyzvöä

AWK（GNU awk 3.1.8）

$ echo "xyzvöä" | LC_ALL=C awk '{match($0,"[a-z]+",a)}END{print a[0]}'
xyzv
$ echo "xyzvöä" | LC_ALL=et_EE.utf8 awk '{match($0,"[a-z]+",a)}END{print a[0]}'
z
$ echo "xyzvöä" | LC_ALL=C awk '{match($0,"\\w+",a)}END{print a[0]}'
xyzv
$ echo "xyzvöä" | LC_ALL=et_EE.utf8 awk '{match($0,"\\w+",a)}END{print a[0]}'
xyzvöä
$ echo "xyzvöä" | LC_ALL=C awk '{match($0,"[[:alpha:]]+",a)}END{print a[0]}'
xyzv
$ echo "xyzvöä" | LC_ALL=et_EE.utf8 awk '{match($0,"[[:alpha:]]+",a)}END{print a[0]}'
xyzvöä

grep (GNU grep 2.10, GNU grep 3.4)

$ echo xuzvöä | LC_ALL=C grep [az]
 xuzv öä
$ echo xuzvöä | LC_ALL=et_EE.utf8 grep [az]
xuzv öä
$ echo xuzvöä | LC_ALL=C grep [[:alpha:]]
xuzv öä
$ echo xuzvöä | LC_ALL=et_EE.utf8 grep [[:alpha:]]
xuzv öä
$ echo xuzvöä | LC_ALL=C grep \\w
xuzv öä
$ echo xuzvöä | LC_ALL=et_EE.utf8 grep \\w
xuzv öä

grep（GNU grep 2.16，GNU grep 2.25）

$ echo xuzvöä | LC_ALL=C grep [az]
 xuzv öä
$ echo xuzvöä | LC_ALL=et_EE.utf8 grep [az]
xu z vöä
$ echo xuzvöä | LC_ALL=C grep [[:alpha:]]
xuzv öä
$ echo xuzvöä | LC_ALL=et_EE.utf8 grep [[:alpha:]]
许兹沃
$ echo xuzvöä | LC_ALL=C grep \\w
xuzv öä
$ echo xuzvöä | LC_ALL=et_EE.utf8 grep \\w
许兹沃

score 5 · Accepted Answer

可能的语言环境错误

您面临的问题不是 POSIX 字符类本身，而是这些类依赖于语言环境这一事实。例如，正则表达式（7）说：

在括号表达式中，包含在“[:”和“:]”中的字符类的名称代表属于该类的所有字符的列表......这些代表在 wctype(3) 中定义的字符类。 语言环境可以提供其他语言环境。

重点是我的，但手册页明确表示字符类取决于语言环境。此外， wctype(3) 说：

wctype() 的行为取决于当前语言环境的 LC_CTYPE 类别。

换句话说，如果您的语言环境错误地定义了一个字符类，那么它就是一个应该针对特定语言环境提交的错误。另一方面，如果字符类只是以您不期望的方式定义字符集，那么它可能不是错误；这可能只是一个需要编码的问题。

字符类作为快捷方式

字符类是定义集合的快捷方式。您当然不限于为您的语言环境预先定义的集合，您可以自由使用 perlre(1) 定义的 Unicode 字符集，或者如果可以提供更高的准确性，则直接显式创建集合。

你已经知道了，所以我不想学究气。我只是指出，如果您不能或不会修复语言环境（这是问题的根源），那么您应该使用显式设置，就像您所做的那样。

便利类仅在适用于您的用例时才方便。如果没有，就把它扔到海里！

score 1 · Accepted Answer

如果这正是您想要的，那么使用[a-z]并没有错。

但认为英语单词仅由 of[a-zA-Z]或德语 of[a-zäöüßA-ZÄÖÜ]或名称跟随是错误的[A-Z][a-z]*。

如果我们想要任何语言或书写系统中的单词（针对 2,300 种语言进行测试，每 50 K 最常见的单词），我们可以使用如下内容：

#!perl

use strict;
use warnings;
use utf8;

use 5.020;    # regex_sets need 5.18

no warnings "experimental::regex_sets";

use Unicode::Normalize;

my $word_frequencies = {};

while (my $line = <>) {
    chomp $line;
    $line = NFC($line);

    # NOTE: will catch "broken" words at end/begin of line
    #       and abbreviations without '.'
    my @words = $line =~ m/(
        (?[ \p{Word} - \p{Digit} + ['`´’] ])
        (?[ \p{Word} - \p{Digit} + ['`´’=⸗‒—-] ])*
    )/xg;
    
    for my $word (@words) {
        $word_frequencies->{$word}++;
    }
}

# now count the frequencies of graphemes the text uses

my $grapheme_frequencies = {};
for my $word (keys %{$word_frequencies}) {
    my @graphemes = m/(\X)/g;
    for my $grapheme (@grapheme) {
        $grapheme_frequencies->{$grapheme} 
            += $word_frequencies->{$word};
    }
}

对于更窄的检查，我们可以查看\p{Word}Unicode 标准https://unicode.org/reports/tr18/#word中的定义

word
    \p{alpha}
    \p{gc=Mark}
    \p{digit}
    \p{gc=Connector_Punctuation}
    \p{Join_Control}

基于此，\p{Word}我们现在可以为例如words拉丁文脚本定义一个正则表达式：

# word:
    \p{Latin}    # \p{alpha}
    \p{gc=Mark}
    # \p{digit}  # we don't want numerals in words
    \p{gc=Connector_Punctuation}
    \p{Join_Control}

score 0 · Accepted Answer

对于 awk，也许在字母表上强制使用八进制代码应该可以避免 awk/poxix/locales 中的不一致

就像是

   /[\060-\071       # 0-9
     \101-\132       # A-Z
     \141-\172]/     # a-z

如果你想把它们变成字符串常量，也许加倍反斜杠以确保解析器/正则表达式引擎不会变得太聪明，并将"\101"预转换为A，并为它提供一个“尊重”语言环境的机会- 可能不是您想要的设置。

"\\101"

regex - 我们应该考虑使用 range [az] 作为一个错误吗？

5 回答 5

可能的语言环境错误

字符类作为快捷方式

Related

Reference