python - Complicated parsing in python

Question

I have a weird parsing problem with python. I need to parse the following text.

Here I need only the section between(not including) "pre" tag and column of numbers (starting with 205 4 164). I have several pages in this format.

<html>
<pre>


A Short Study of Notation Efficiency

CACM August, 1960

Smith Jr., H. J.

CA600802 JB March 20, 1978  9:02 PM

205 4   164
210 4   164
214 4   164
642 4   164
1   5   164

</pre>
</html>

score 3 · Accepted Answer

Quazi，这需要一个正则表达式，特别<pre>(.+?)(?:\d+\s+){3}是启用了 DOTALL 标志。

您可以在http://docs.python.org/library/re.html找到有关如何在 Python 中使用正则表达式的信息，如果您进行了大量此类字符串提取，您会很高兴您做到了。逐个检查我提供的正则表达式：

<pre>只是直接匹配 pre 标签
(.+?)匹配并捕获
(?:\d+\s+){3}与一些数字后跟一些空格匹配的任何字符，连续 3 次

score 2 · Accepted Answer

我可能会使用 lxml 或 BeautifulSoup。IMO，正则表达式被过度使用，特别是在解析 HTML 时。

score 2 · Accepted Answer

这是执行此操作的正则表达式：

findData = re.compile('(?<=<pre>).+?(?=[\d\s]*</pre>)', re.S)

# ...

result = findData.search(data).group(0).strip()

这是一个演示。

score 1 · Accepted Answer

其他人提供了正则表达式解决方案，这些解决方案很好，但有时可能会出现意外行为。

如果页面完全如您的示例中所示，那就是：

不存在其他 HTML 标记 - 只有<html>and<pre>标记
行数始终一致
行间距始终保持一致

然后像这样的简单方法就可以了：

my_text = """<html>
<pre>


A Short Study of Notation Efficiency

CACM August, 1960

Smith Jr., H. J.

CA600802 JB March 20, 1978  9:02 PM

205 4   164
210 4   164
214 4   164
642 4   164
1   5   164

</pre>
</html>"""

lines = my_text.split("\n")

title   = lines[4]
journal = lines[6]
author  = lines[8]
date    = lines[10]

如果不能保证行间距，但可以保证只需要;内的前四个非空白行<html><pre>。

import pprint

max_extracted_lines = 4
extracted_lines = []
for line in lines:
    if line == "<html>" or line == "<pre>":
        continue
    if line:
        extracted_lines.append(line)
    if len(extracted_lines) >= max_extracted_lines:
        break

pprint.pprint(extracted_lines)

给出输出：

['A Short Study of Notation Efficiency',
 'CACM August, 1960',
 'Smith Jr., H. J.',
 'CA600802 JB March 20, 1978  9:02 PM']

不要在可以进行简单字符串操作的地方使用正则表达式。

python - Complicated parsing in python

4 回答 4

Related

Reference