python - Unicode、正则表达式和 PyPy

Question

我编写了一个程序来向 Python 正则表达式添加（有限的）unicode 支持，虽然它在 CPython 2.5.2 上运行良好，但在 PyPy（~~1.5.0-alpha0~~ 1.8.0，实现 Python ~~2.7.1~~ 2.7.2）上运行良好，两者都在 Windows XP 上运行（编辑：如评论中所见，@dbaupp 可以在 Linux 上正常运行）。我不知道为什么，但我怀疑这与我对u"and的使用有关ur"。完整来源在这里，相关位是：

# -*- coding:utf-8 -*-
import re

# Regexps to match characters in the BMP according to their Unicode category.
# Extracted from Unicode specification, version 5.0.0, source:
# http://unicode.org/versions/Unicode5.0.0/
unicode_categories = {
    ur'Pi':ur'[\u00ab\u2018\u201b\u201c\u201f\u2039\u2e02\u2e04\u2e09\u2e0c\u2e1c]',
    ur'Sk':ur'[\u005e\u0060\u00a8\u00af\u00b4\u00b8\u02c2-\u02c5\u02d2-\u02df\u02...',
    ur'Sm':ur'[\u002b\u003c-\u003e\u007c\u007e\u00ac\u00b1\u00d7\u00f7\u03f6\u204...',
    ...
    ur'Pf':ur'[\u00bb\u2019\u201d\u203a\u2e03\u2e05\u2e0a\u2e0d\u2e1d]',
    ur'Me':ur'[\u0488\u0489\u06de\u20dd-\u20e0\u20e2-\u20e4]',
    ur'Mc':ur'[\u0903\u093e-\u0940\u0949-\u094c\u0982\u0983\u09be-\u09c0\u09c7\u0...',
}

def hack_regexp(regexp_string):
    for (k,v) in unicode_categories.items():
        regexp_string = regexp_string.replace((ur'\p{%s}' % k),v)
    return regexp_string

def regex(regexp_string,flags=0):
    """Shortcut for re.compile that also translates and add the UNICODE flag

    Example usage:
        >>> from unicode_hack import regex
        >>> result = regex(ur'^\p{Ll}\p{L}*').match(u'áÇñ123')
        >>> print result.group(0)
        áÇñ
        >>> 
    """
    return re.compile(hack_regexp(regexp_string), flags | re.UNICODE)

（在 PyPy 上，“示例用法”中没有匹配项，所以result是None）

重申一下，程序运行良好（在 CPython 上）：Unicode 数据似乎正确，替换按预期工作，使用示例运行正常（通过doctest和直接在命令行中键入）。源文件编码也是正确的，coding头文件中的指令似乎可以被Python识别。

关于 PyPy 的“不同”行为是否会破坏我的代码的任何想法？许多事情浮现在我的脑海（无法识别coding的标头、命令行中的不同编码、对rand的不同解释u），但就我的测试而言，CPython 和 PyPy 的行为似乎相同，所以我对下一步该尝试什么一无所知。

score 7 · Accepted Answer

为什么不直接使用Matthew Barnett 的超级推荐regexp模块呢？

它适用于 Python 3 和旧版 Python 2，是re.

score 6 · Accepted Answer

coding似乎 PyPy 在读取源文件（可能是无法识别的标头）和在命令行中输入/输出时都有一些编码问题。我用以下代码替换了我的示例代码：

>>> from unicode_hack import regex
>>> result = regex(ur'^\p{Ll}\p{L}*').match(u'áÇñ123')
>>> print result.group(0) == u'áÇñ'
True
>>>

它继续在 CPython 上工作并在 PyPy 上失败。将“áÇñ”替换为其转义字符 - u'\xe1\xc7\xf1'- OTOH 成功了：

>>> from unicode_hack import regex
>>> result = regex(ur'^\p{Ll}\p{L}*').match(u'\xe1\xc7\xf1123')
>>> print result.group(0) == u'\xe1\xc7\xf1'
True
>>>

这对两者都很好。我相信问题仅限于这两种情况（源加载和命令行），因为尝试使用打开 UTF-8 文件可以codecs.open正常工作。当我尝试在命令行中输入字符串“áÇñ”时，或者当我使用加载“unicode_hack.py”的源代码时codecs，我在 CPython 上得到相同的结果：

>>> u'áÇñ'
u'\xe1\xc7\xf1'
>>> import codecs
>>> codecs.open('unicode_hack.py','r','utf8').read()[19171:19174]
u'\xe1\xc7\xf1'

但 PyPy 的结果不同：

>>>> u'áÇñ'
u'\xa0\u20ac\xa4'
>>>> import codecs
>>>> codecs.open('unicode_hack.py','r','utf8').read()[19171:19174]
u'\xe1\xc7\xf1'

更新：在 PyPy 错误跟踪系统上提交的 Issue1139，让我们看看结果如何......

python - Unicode、正则表达式和 PyPy

2 回答 2

Related

Reference