I'm trying to split several key-value lines with regular expression in Python. The file I'm working have more than 1.2M lines, so I created another one with a few lines that suits all different key-value's occurrences I need to care about:
@=""
@="0"
@="="
@="@"
@="k=\"v\""
@=dword:00000000
@=hex:00
"k"=""
"k"="="
"k"="@"
"k"="k=\"v\""
"k"="v"
"k"=dword:00000000
"k"=hex:00
"k=\"v\""=""
"k=\"v\""="="
"k=\"v\""="@"
"k=\"v\""="k=\"v\""
"k=\"v\""="v"
"k=\"v\""=dword:00000000
"k=\"v\""=hex:00
I'm already doing the job with a fairly simple look-behind/look-ahead regex that works like a charm:
#!/usr/bin/env python
import re
regex = re.compile(r'(?<=@|")=(?=[dh"])')
for line in open('split-test'):
line = line.strip()
key, value = regex.split(line, 1)
if key != '@':
key = key[1:-1]
print '{} => {}'.format(key, value)
Output:
@ => ""
@ => "0"
@ => "="
@ => "@"
@ => "k=\"v\""
@ => dword:00000000
@ => hex:00
k => ""
k => "="
k => "@"
k => "k=\"v\""
k => "v"
k => dword:00000000
k => hex:00
k=\"v\" => ""
k=\"v\" => "="
k=\"v\" => "@"
k=\"v\" => "k=\"v\""
k=\"v\" => "v"
k=\"v\" => dword:00000000
k=\"v\" => hex:00
As you can see, in the code flow I'll have to strip the leading and trailing quotes from the key part. That said, I've to state that I'm not trying to optimize anything, I'm just trying to learn how I can achieve the same results with the regular expression itself.
I've tried many changes in the above original code, and I successfully got a new horrible-and-slow-but-working regexp with the following code:
#!/usr/bin/env python
import re
regex = re.compile(r'(?:(@)|(?:"((?:(?:[^"\\]+)|\\.)*)"))=')
for line in open('split-test'):
line = line.strip()
key, value = filter(None, regex.split(line))
print '{} => {}'.format(key, value)
Here I'd have to use filter()
'cause it matches some empty strings. I'm not a regular expression master, so I'm just wondering any better written regex that would do this job.