python - 正则表达式多行 - 如何抓取页面源的一部分

Question

很抱歉，如果这个问题以前曾向您提出过，但我发现 python 正则表达式文档很难理解，主要是由于缺乏示例。我想抓取一个页面源块，以便稍后再次解析。例如：

    <div id="viewed"><div class="shortstory-block">

    <div class="shortstoey-block-image">
        <a href="...."><img src="/uploads/posts/cov.jpg" alt="instance 1"/></a>
        <span class="format"><a href="http://www..../">something</a></span>
    </div>

    <a href="http://....."><span class="shortstory-block-title" style="text-decoration:none !important;">
        Something
    </span>
    </a>

</div><div class="shortstory-block">

    <div class="shortstoey-block-image">
        <a href="...."><img src="/uploads/posts/cov.jpg" alt="something 2"/></a>
        <span class="format"><a href="http://www.website/xfsearch/smth/">something</a></span>
    </div>

    <a href="http://web.html"><span class="shortstory-block-title" style="text-decoration:none !important;">
        Something
    </span>
    </a>
 </div>
  (* x times)
     <div id="rated">....

我有一个变量（html_source）中的所有页面源，我想定义另一个变量，只有这个代码块（在 div id="viewed" 和 div id="rated" 之间）。尽管我可以在两个实例之间找到任何 \n 或 \r，但我想抓住一切。

有人可以指出我正确的方向（正则表达式）吗？

提前致谢

score 2 · Accepted Answer

如果您确实只是想在两个文本元素之间找到一些东西，您可以使用以下正则表达式：

import re

with open('yourfile') as fin:
    page_source = fin.read()

start_text = re.escape('<div id="viewed">')
until_text = re.escape('<div id="rated">')
match_text = re.search('{}(.*?){}'.format(start_text, until_text), page_source, flags=re.DOTALL)
if match_text:
    print match_text.group(1)

score 1 · Accepted Answer

re.DOTALL国旗使. 匹配任何字符。没有那个标志，它不会匹配换行符。

（DOTALL 也可以(?s)在正则表达式本身中拼写。）

对于类似的问题，使用代码示例和更好的方法来执行此操作，请参阅： Python's "re" module not working?

python - 正则表达式多行 - 如何抓取页面源的一部分

2 回答 2

Related

Reference