python - Python Beautiful Soup .content 属性

Question

BeautifulSoup 的 .content 有什么作用？我正在阅读crummy.com 的教程，但我并不真正了解 .content 的作用。我查看了论坛，但没有看到任何答案。看下面的代码......

from BeautifulSoup import BeautifulSoup
import re



doc = ['<html><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
        '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
        '</html>']

soup = BeautifulSoup(''.join(doc))
print soup.contents[0].contents[0].contents[0].contents[0].name

我希望代码的最后一行打印出'body'而不是......

  File "pe_ratio.py", line 29, in <module>
    print soup.contents[0].contents[0].contents[0].contents[0].name
  File "C:\Python27\lib\BeautifulSoup.py", line 473, in __getattr__
    raise AttributeError, "'%s' object has no attribute '%s'" % (self.__class__.__name__, attr)
AttributeError: 'NavigableString' object has no attribute 'name'

.content 只与 html、head 和 title 有关吗？如果，那为什么呢？

我在这里先向您的帮助表示感谢。

score 3 · Accepted Answer

它只是为您提供标签内的内容。让我用一个例子来演示：

html_doc = """
<html><head><title>The Dormouse's story</title></head>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
head = soup.head

print head.contents

上面的代码给了我一个列表，[<title>The Dormouse's story</title>]，因为那在标签内head。所以打电话[0]会给你列表中的第一个项目。

您收到错误的原因是因为soup.contents[0].contents[0].contents[0].contents[0]返回的内容没有其他标签（因此没有属性）。它Page Title从您的代码返回，因为第一个contents[0]为您提供 HTML 标记，第二个为您提供head标记。第三个指向title标签，第四个为您提供实际内容。所以，当你调用name它时，它没有标签可以给你。

如果要打印正文，可以执行以下操作：

soup = BeautifulSoup(''.join(doc))
print soup.body

如果您只想body使用contents，请使用以下内容：

soup = BeautifulSoup(''.join(doc))
print soup.contents[0].contents[1].name

您不会将其[0]用作索引，因为body它是 . 之后的第二个元素head。

python - Python Beautiful Soup .content 属性

1 回答 1

Related

Reference