html-parsing - BeautifulSoup：仅解析页面的一部分

Question

我想解析html页面的一部分，比如说

my_string = """
<p>Some text. Some text. Some text. Some text. Some text. Some text.
   <a href="#">Link1</a>
   <a href="#">Link2</a>
</p>
<img src="image.png" />
<p>One more paragraph</p>
"""

我将此字符串传递给 BeautifulSoup：

soup = BeautifulSoup(my_string)
# add rel="nofollow" to <a> tags
# return comment to the template

但是在解析 BeautifulSoup 的过程中会添加<html>,<head>和<body>标签（如果使用 lxml 或 html5lib 解析器），我的代码中不需要这些。到目前为止，我发现避免这种情况的唯一方法是使用html.parser.

我想知道是否有一种方法可以使用最快的解析器 lxml 摆脱冗余标签。

更新

最初我的问题被错误地问到了。现在我<div>从示例中删除了包装器，因为普通用户不使用此标签。出于这个原因，我们不能使用.extract()方法来摆脱<html>,<head>和<body>标签。

score 1 · Accepted Answer

1

利用

soup.body.renderContents()

于 2012-12-05T09:22:00.170 回答

score 0 · Accepted Answer

lxml 将始终添加这些标签，但您可以使用从其中Tag.extract()删除<div>标签：

comment = soup.body.div.extract()

score 0 · Accepted Answer

我可以使用.contents属性解决问题：

try:
    children = soup.body.contents
    string = ''
    for child in children:
        string += str(item)
    return string
except AttributeError:
    return str(soup)

我认为这''.join(soup.body.contents)将是更简洁的字符串转换列表，但这不起作用，我得到了

类型错误：序列项 0：预期字符串，找到标记

html-parsing - BeautifulSoup：仅解析页面的一部分

3 回答 3

Related

Reference