2

I'm trying to split several key-value lines with regular expression in Python. The file I'm working have more than 1.2M lines, so I created another one with a few lines that suits all different key-value's occurrences I need to care about:

@=""
@="0"
@="="
@="@"
@="k=\"v\""
@=dword:00000000
@=hex:00
"k"=""
"k"="="
"k"="@"
"k"="k=\"v\""
"k"="v"
"k"=dword:00000000
"k"=hex:00
"k=\"v\""=""
"k=\"v\""="="
"k=\"v\""="@"
"k=\"v\""="k=\"v\""
"k=\"v\""="v"
"k=\"v\""=dword:00000000
"k=\"v\""=hex:00

I'm already doing the job with a fairly simple look-behind/look-ahead regex that works like a charm:

#!/usr/bin/env python
import re
regex = re.compile(r'(?<=@|")=(?=[dh"])')

for line in open('split-test'):
    line = line.strip()
    key, value = regex.split(line, 1)

    if key != '@':
        key = key[1:-1]

print '{} => {}'.format(key, value)

Output:

@ => ""
@ => "0"
@ => "="
@ => "@"
@ => "k=\"v\""
@ => dword:00000000
@ => hex:00
k => ""
k => "="
k => "@"
k => "k=\"v\""
k => "v"
k => dword:00000000
k => hex:00
k=\"v\" => ""
k=\"v\" => "="
k=\"v\" => "@"
k=\"v\" => "k=\"v\""
k=\"v\" => "v"
k=\"v\" => dword:00000000
k=\"v\" => hex:00

As you can see, in the code flow I'll have to strip the leading and trailing quotes from the key part. That said, I've to state that I'm not trying to optimize anything, I'm just trying to learn how I can achieve the same results with the regular expression itself.

I've tried many changes in the above original code, and I successfully got a new horrible-and-slow-but-working regexp with the following code:

#!/usr/bin/env python
import re
regex = re.compile(r'(?:(@)|(?:"((?:(?:[^"\\]+)|\\.)*)"))=')

for line in open('split-test'):
    line = line.strip()
    key, value = filter(None, regex.split(line))

    print '{} => {}'.format(key, value)

Here I'd have to use filter() 'cause it matches some empty strings. I'm not a regular expression master, so I'm just wondering any better written regex that would do this job.

4

3 回答 3

3

这可以解决问题吗:

#!/usr/bin/env python
import re
string = r"""@=""
@="0"
@="="
@="@"
@="k=\"v\""
@=dword:00000000
@=hex:00
"k"=""
"k"="="
"k"="@"
"k"="k=\"v\""
"k"="v"
"k"=dword:00000000
"k"=hex:00
"k=\"v\""=""
"k=\"v\""="="
"k=\"v\""="@"
"k=\"v\""="k=\"v\""
"k=\"v\""="v"
"k=\"v\""=dword:00000000
"k=\"v\""=hex:00
"""
regex = re.compile(r'("?)(.*)\1=(["hd].+)')

results = regex.findall(string)
for _, key, value in results:
    print '{} => {}'.format(key, value)

它给出了以下结果:

@ => ""
@ => "0"
@ => "="
@ => "@"
@ => "k=\"v\""
@ => dword:00000000
@ => hex:00
k => ""
k => "="
k => "@"
k => "k=\"v\""
k => "v"
k => dword:00000000
k => hex:00
k=\"v\" => ""
k=\"v\" => "="
k=\"v\" => "@"
k=\"v\" => "k=\"v\""
k=\"v\" => "v"
k=\"v\" => dword:00000000
k=\"v\" => hex:00
于 2013-08-29T16:29:39.837 回答
1

我认为您在上一个尝试解析引号的正则表达式方面走在了正确的轨道上。这使用捕获缓冲区而不是拆分。

有两种方法可以去。

假设报价不完美(不平衡)-

 #  ^((?:"[^"\\]*(?:\\.[^"\\]*)*"|.)*)=((?:"[^"\\]*(?:\\.[^"\\]*)*"|[^=])*)$

 ^
 (                         # (1 start)
      (?:
           "
           [^"\\]* 
           (?: \\ . [^"\\]* )*
           "
        |  .
      )*
 )                         # (1 end)
 =
 (                         # (2 start)
      (?:
           "
           [^"\\]* 
           (?: \\ . [^"\\]* )*
           "
        |  [^=]
      )*
 )                         # (2 end)
 $

或者,假设它们是完美的——

 #  ^((?:"[^"\\]*(?:\\.[^"\\]*)*"|[^"])*)=((?:"[^"\\]*(?:\\.[^"\\]*)*"|[^="])*)$

 ^
 (                         # (1 start)
      (?:
           "
           [^"\\]* 
           (?: \\ . [^"\\]* )*
           "
        |  [^"]
      )*
 )                         # (1 end)
 =
 (                         # (2 start)
      (?:
           "
           [^"\\]* 
           (?: \\ . [^"\\]* )*
           "
        |  [^="]
      )*
 )                         # (2 end)
 $
于 2013-08-29T17:22:47.490 回答
1

所以你想要一个在匹配过程中刮掉引号的正则表达式?看一下这个:

r'^(")?((?(1)[^"\\]*(?:\\.[^"\\]*)*|@))"?=([dh"].+$)'

如果第一个字符是引号,它会被捕获到第 1 组中,(1)条件成功,并且条件的 YES 分支会消耗所有内容,直到下一个非转义引号(但不是引号本身)。如果不是,则 NO 分支尝试匹配@。无论哪种方式,密钥都被捕获在第 2 组中,而不包含引号。

正则表达式的其余部分很简单:它使用尾随引号(如果有的话)和=,然后字符串的其余部分被捕获在组 #3 中。请注意,它可以匹配以@"or开头的格式错误的输入""。如果这不可接受,您可以在实际匹配开始之前添加一个前瞻来验证格式。我没有打扰,因为额外的混乱会妨碍解释核心技术。

^
(")?
(
  (?(1)
    [^"\\]*(?:\\.[^"\\]*)*
    |
    @
  )
)
"?
=
([dh"].+$)
于 2013-08-29T21:13:48.483 回答