python - 匹配 HTML 中的特定表格，BeautifulSoup

Question

我有这个问题。我试图抓取的页面上有几个类似的表格。

<h2 class="tabellen_ueberschrift al">Points</h2>
<div class="fl" style="width:49%;">     
<table class="tabelle_grafik lh" cellpadding="2" cellspacing="1">

它们之间的唯一区别是h2标签中的文本，这里：Points

如何指定我需要在哪个表中搜索？

我有这段代码，需要调整h2标签因子：

my_tab = soup.find('table', {'class':'tabelle_grafik lh'})

需要一些帮助的家伙。

score 3 · Accepted Answer

这对我有用。查找“previousSiblings”，如果您在具有不同文本内容的 h2 标记之前找到带有文本“Points”的 h2，则您找到了一个好表

from BeautifulSoup import BeautifulSoup

t="""
<h2 class="tabellen_ueberschrift al">Points</h2>
<table class="tabelle_grafik lh" cellpadding="2" cellspacing="1">
<th><td>yes me!</th></td></table>
<h2 class="tabellen_ueberschrift al">Bad</h2>
<table class="tabelle_grafik lh" cellpadding="2" cellspacing="1">
<th><td>woo woo</td></th></table>
"""

soup = BeautifulSoup(t)

for ta in soup.findAll('table'):
    for s in ta.findPreviousSiblings():
        if s.name == u'h2':
            if s.text == u'Points':
                print ta 
            else:
                break;

score 1 · Accepted Answer

看起来这是xpath的工作。但是，BeautifulSoup 不支持 XPath 表达式。

考虑切换到lxml或scrapy。

仅供参考，对于测试 xml，例如：

<html>
<h2 class="tabellen_ueberschrift al">Points</h2>  
<div class="fl" style="width:49%;">   
<table class="tabelle_grafik lh" cellpadding="2" cellspacing="1">a</table>
</div>

<h2 class="tabellen_ueberschrift al">Illegal</h2>
<div class="fl" style="width:49%;">     
<table class="tabelle_grafik lh" cellpadding="2" cellspacing="1">b</table>
</div>
</html>

在 h2="Points" 之后在 div 中查找具有类 "tabelle_grafik lh" 的表的 XPath 表达式为：

//table[@class="tabelle_grafik lh" and ../preceding-sibling::h2[1][text()="Points"]]

python - 匹配 HTML 中的特定表格，BeautifulSoup

2 回答 2

Related

Reference