python - 如何在 Python 中解析和提取 HTML 文档中的特定元素？

Question

Python 中有很多 XML 和 HTML 解析器，我正在寻找一种简单的方法来提取 HTML 文档的一部分，最好使用 XPATH 构造，但这只是可选的。

这是一个例子

src = "<html><body>...<div id=content>AAA<B>BBB</B>CCC</div>...</body></html>"

我想用 id=content 提取元素的整个主体，所以结果应该是：<div id=content>AAA<B>BBB</B>CCC</div>

如果我可以在不安装新库的情况下做到这一点。

我还希望获得所需元素的原始内容（未重新格式化）。

不允许使用正则表达式，因为这些对于解析 XML/HTML 是不安全的。

score 1 · Accepted Answer

使用库进行解析 - 最好的方法是BeautifulSoup 这是它如何为您工作的片段！

from BeautifulSoup import BeautifulSoup

src = "<html><body>...<div id=content>AAA<B>BBB</B>CCC</div>...</body></html>"
soupy = BeautifulSoup( src )

content_divs = soupy.findAll( attrs={'id':'content'} )
if len(content_divs) > 0:
    # print the first one
    print str(content_divs[0])

    # to print the text contents
    print content_divs[0].text

    # or to print all the raw html
    for each in content_divs:
        print each

score 0 · Accepted Answer

是的，我已经做到了。这可能不是最好的方法，但它的工作原理类似于下面的代码。我没有测试这个

import re

match = re.finditer("<div id=content>",src)
src = src[match.start():]

#at this point the string start with your div everything proceeding it has been stripped.
#This next part works because the first div in the string is the end of your div section.
match = re.finditer("</div>",src)
src = src[:match.end()]

src 现在只有字符串中的 div 。如果在某些情况下您想要的内容中有另一个，您只需为您的 re.finditer 部分构建一个更高级的搜索模式。

python - 如何在 Python 中解析和提取 HTML 文档中的特定元素？

2 回答 2

Related

Reference