python - 使用 beautifulsoup 解析 HTML 页面

Question

我开始研究用于解析 HTML 的 beautifulsoup。
例如对于网站“ http://en.wikipedia.org/wiki/PLCB1 ”

import sys
sys.setrecursionlimit(10000)

import urllib2, sys
from BeautifulSoup import BeautifulSoup

site= "http://en.wikipedia.org/wiki/PLCB1"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(site,headers=hdr)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)

table = soup.find('table', {'class':'infobox'})
#print table
rows = table.findAll("th")
for x in rows:
    print "x - ", x.string

在某些有 url 的情况下，我得到的输出为 None。为什么会这样？

输出：

x -  Phospholipase C, beta 1 (phosphoinositide-specific)
x -  Identifiers
x -  None
x -  External IDs
x -  None
x -  None
x -  Molecular function
x -  Cellular component
x -  Biological process
x -  RNA expression pattern
x -  Orthologs
x -  Species
x -  None
x -  None
x -  None
x -  RefSeq (mRNA)
x -  RefSeq (protein)
x -  Location (UCSC)
x -  None

例如，在 Location 之后，还有一个包含“pubmed search”但显示为 None。我想知道为什么会这样。

第二：有没有办法在字典中获取 th和
相应的 td 以便于解析？

score 5 · Accepted Answer

Element.string仅当元素中直接有文本时才包含值。不包括嵌套元素。

如果您使用的是 BeautifulSoup 4，请Element.stripped_strings改用：

print ''.join(x.stripped_strings)

对于 BeautifulSoup 3，您需要搜索所有文本元素：

print ''.join([unicode(t).strip() for t in x.findAll(text=True)])

如果你想将<th>和<td>元素组合成一个字典，你将遍历所有<th>元素，然后使用.findNextSibling()来定位相应的<td>元素，并将其与上述.findAll(text=True)技巧结合起来构建一个字典：

info = {}
rows = table.findAll("th")
for headercell in rows:
    valuecell = headercell.findNextSibling('td')
    if valuecell is None:
        continue
    header = ''.join([unicode(t).strip() for t in headercell.findAll(text=True)])
    value = ''.join([unicode(t).strip() for t in valuecell.findAll(text=True)])
    info[header] = value

score 2 · Accepted Answer

如果您检查 html，

<th colspan="4" style="text-align:center; background-color: #ddd">Identifiers</th>
</tr>
<tr class="">
<th style="background-color: #c3fdb8"><a href="/wiki/Human_Genome_Organisation" title="Human Genome Organisation">Symbols</a></th>
<td colspan="3" class="" style="background-color: #eee"><span class="plainlinks"><a rel="nofollow" class="external text" href="http://www.genenames.org/data/hgnc_data.php?hgnc_id=15917">PLCB1</a>; EIEE12; PI-PLC; PLC-154; PLC-I; PLC154; PLCB1A; PLCB1B</span></td>
</tr>
<tr class="">
<th style="background-color: #c3fdb8">External IDs</th>

您会在Identifiers和之间看到External IDs一个<th>没有文本的标签，只有一个<a>标签：

<th style="background-color: #c3fdb8"><a href="/wiki/Human_Genome_Organisation" title="Human Genome Organisation">Symbols</a></th>

这<th>没有文字。x.string也是如此None。

python - 使用 beautifulsoup 解析 HTML 页面

2 回答 2

Related

Reference