python - “破碎”的正则表达式？

Question

我有用于解析许多值的正则表达式，例如a=b c=d e=f应该导致这样的字典：{'a': 'b', 'c':'d', 'e':'f'}。我希望用户允许使用\so 而不是我使用过的非常简单的正则表达式来转义值，并且((?:[^\\\s=]+|\\.)+)我已经添加了(?:^|\s)，(?=\s|$)因此表达式不会匹配部分结果。

>>> import re
>>> reg = re.compile(r'(?:^|\s)([\w\d]+)=((?:[^\\\s=]+|\\.)+)(?=\s|$)')
>>> s = r'a=b c=d e=one\two\three'
>>> reg.findall(s)
[('a', 'b'), ('c', 'd'), ('e', 'one\\two\\three')]

但随后有人走过来，插入=了东西的右侧。

>>> s = r'a=b c=d e=aaaaaaaaaaaaaaaaaaaaaaaaaa\bbbbbbbbbbbbbbbbbbbbbbbbbbbb\cccc
    ccccc=dddddddddddddddd\eeeeeeeeeeeeeee'    
>>> reg.findall(s)

并且脚本卡在这条线上（我已经等了几个小时但没有完成）。

问题：这是那个糟糕的正则表达式（为什么？你会怎么写？）还是正则表达式实现错误？

注意：我不是在为这个问题寻求解决方案，我很好奇为什么 findall()没有在几个小时内完成。

score 1 · Accepted Answer

Your problem is that you nest repetitions and the re-engine seems to try all possible distributions among them:

r'(?:^|\s)([\w\d]+)=((?:[^\\\s=]+|\\.)+)(?=\s|$)'
                                ^     ^

Better:

r'(?:^|\s)([\w\d]+)=((?:[^\\\s=]|\\.)+)(?=\s|$)'

In fact the findall would finish (or run out of memory). You can try this with

s = r'a=b c=d e=aaaaaaa\bbbbbbbb\ccccccccc=ddddddddd\eeeee'

and then successively adding characters after "e="

score 0 · Accepted Answer

0

于 2013-06-12T08:44:07.847 回答

score 0 · Accepted Answer

>>> import re
>>> reg = re.compile(r'(\w+)=(\S+)')
>>> dict(reg.findall(r'a=b c=d e=one\two\three'))
{'e': 'one\\two\\three', 'a': 'b', 'c': 'd'}
>>> dict(reg.findall(r'a=b c=d e=aaaaaaaaaaaaaaaaaaaaaaaaaa\bbbbbbbbbbbbbbbbbbbbbbbbbbbb\ccccccccc=dddddddddddddddd\eeeeeeeeeeeeeee'))
{'e': 'aaaaaaaaaaaaaaaaaaaaaaaaaa\\bbbbbbbbbbbbbbbbbbbbbbbbbbbb\\ccccccccc=dddddddddddddddd\\eeeeeeeeeeeeeee', 'a': 'b', 'c': 'd'}

python - “破碎”的正则表达式？

3 回答 3

Related

Reference