python - 如何提高 python 正则表达式数据文件的解析效率？

Question

我有这样的数据文件：

group Head:
  data1:        abc         data2:            def
  2word data3:  ghi         data4:            jkl
  data3:        mno         three word data4: pqr stu

所以在python中我建立了一个像这样的正则表达式：

Data = re.findall(r'(([\w\(\)]+[ \t\f]?)+):([ \t\f]*(\S+))', data)

我的文件接近 600 行，通常有 2 列，如上所示，每个文件解析它们需要几分钟。

使此代码更高效以便每个文件在不到 10 秒内运行的最佳方法是什么？

score 2 · Accepted Answer

import re

data = """group Head:
  data1: abc         data2: def
  2word data3: ghi   data4: jkl
  data3: mno         three word data4: pqr stu"""

for l in data.split('\n'):
    print [ x.split(':') for x in re.split('\s\s+', l) if x ]

给出：

[['group Head', '']]
[['data1', ' abc'], ['data2', ' def']]
[['2word data3', ' ghi'], ['data4', ' jkl']]
[['data3', ' mno'], ['three word data4', ' pqr stu']]

score 2 · Accepted Answer

您正在嵌套重复运算符，并且可能会得到指数回溯。

试试这个：

r'(\S.+)\s*:\s*(\S+)'

非空格后跟其他任何内容，冒号周围有可选的空格，还有一些非空格。

score 1 · Accepted Answer

这可能需要更短的时间

 # ([\w()](?:[^\S\r\n]?[\w()]+)*)[^\S\r\n]*:[^\S\r\n]*([\w()](?:[^\S\r\n]?[\w()]+)*)

 (                                 # (1) Key
      [\w()] 
      (?: [^\S\r\n]? [\w()]+ )*
 )
 [^\S\r\n]* : [^\S\r\n]* 
 (                                 # (2) Value
      [\w()] 
      (?: [^\S\r\n]? [\w()]+ )*
 )

score 0 · Accepted Answer

预编译你的正则表达式。文档。

如果可能，请拆分文件并逐行解析。

两者都应该有助于减少您的时间。

python - 如何提高 python 正则表达式数据文件的解析效率？

4 回答 4

Related

Reference