python - 美丽的汤找不到字符串

Question

解析http://en.wikipedia.org/wiki/Israel时，我遇到一个H2包含文本的标签，但 Beautiful SoupNone为它返回一个类型：

$ python
Python 2.7.3 (default, Apr 10 2013, 05:13:16)
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import bs4
>>> import requests
>>> from pprint import pprint
>>> response = requests.get('http://en.wikipedia.org/wiki/Israel')
>>> soup = bs4.BeautifulSoup(response.content)
>>> for h in soup.find_all('h2'):
...     pprint(str(type(h)))
...     pprint(h)
...     pprint(str(type(h.string)))
...     pprint(h.string)
...     print('--')
...                     
"<class 'bs4.element.Tag'>"
<h2>Contents</h2>    
"<class 'bs4.element.NavigableString'>"
u'Contents'          
--                   
"<class 'bs4.element.Tag'>"
<h2><span class="mw-headline" id="Etymology"><span id="Etymology"></span> Etymology</span></h2>
"<type 'NoneType'>"  
None                 
--                   
"<class 'bs4.element.Tag'>"
<h2><span class="mw-headline" id="History">History</span></h2>
"<class 'bs4.element.NavigableString'>"
u'History'           
--

请注意，这不是解析问题，Beautiful Soup 可以很好地解析文档。为什么第二个H2元素返回一个None类型？是否由于字符串中的前导“”（空格）？我该如何解决这个问题？这是 Python 2.7 上的 Beautiful Soup 4，Kubuntu Linux 12.10。

score 2 · Accepted Answer

我先回答前半段，怎么了……

引用bs4 的文档：“如果一个标签包含多个东西，那么不清楚.string应该引用什么，所以.string定义为None.”

现在另一半，如何解决它。

再次引用同一来源：“如果标签内有多个东西，您仍然可以只查看字符串。使用.strings生成器。”。更好的是，使用.stripped_strings生成器，连接结果，我想你会得到你想要的。

score 1 · Accepted Answer

我认为这是因为第二个h2没有文本，而是有一个span作为孩子的（并且那个跨度有另一个孩子作为它的孩子，这使得它成为h2孙子。

对于这种解析，请使用基于生成器的属性，例如.stripped_strings和.strings。

>>> s.find_all('h2')
[<h2>Contents</h2>, <h2><span class="mw-headline" id="Etymology"><span id="Etymology"></span> Etymology</span></h2>]
>>> list(s.find_all('h2')[-1].stripped_strings)
[u'Etymology']

python - 美丽的汤找不到字符串

2 回答 2

Related

Reference