python - 使用python和BeautifulSoup从html中提取表格内容

Question

我想从 html 文档中提取某些信息。例如，它包含一个表格（在其他包含其他内容的表格中），如下所示：

    <table class="details">
            <tr>
                    <th>Advisory:</th>
                    <td>RHBA-2013:0947-1</td>
            </tr>
            <tr>    
                    <th>Type:</th>
                    <td>Bug Fix Advisory</td>
            </tr>
            <tr>
                    <th>Severity:</th>
                    <td>N/A</td>
            </tr>
            <tr>    
                    <th>Issued on:</th>
                    <td>2013-06-13</td>
            </tr>
            <tr>    
                    <th>Last updated on:</th>
                    <td>2013-06-13</td>
            </tr>

            <tr>
                    <th valign="top">Affected Products:</th>
                    <td><a href="#Red Hat Enterprise Linux ELS (v. 4)">Red Hat Enterprise Linux ELS (v. 4)</a></td>
            </tr>


    </table>

我想提取诸如“发行日期：”之类的信息。看起来 BeautifulSoup4 可以很容易地做到这一点，但不知何故我没能把它做好。到目前为止我的代码：

    from bs4 import BeautifulSoup
    soup=BeautifulSoup(unicodestring_containing_the_entire_htlm_doc)
    table_tag=soup.table
    if table_tag['class'] == ['details']:
            print table_tag.tr.th.get_text() + " " + table_tag.tr.td.get_text()
            a=table_tag.next_sibling
            print  unicode(a)
            print table_tag.contents

这让我得到了第一个表格行的内容，以及内容列表。但是下一个兄弟的东西不能正常工作，我想我只是用错了。当然我可以只解析内容，但在我看来，漂亮的汤是为了防止我们这样做（如果我开始解析自己，我不妨解析整个文档......）。如果有人能告诉我如何实现这一点，我将不胜感激。如果有比 BeautifulSoup 更好的方法，我很想听听。

score 26 · Accepted Answer

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(unicodestring_containing_the_entire_htlm_doc)
>>> table = soup.find('table', {'class': 'details'})
>>> th = table.find('th', text='Issued on:')
>>> th
<th>Issued on:</th>
>>> td = th.findNext('td')
>>> td
<td>2013-06-13</td>
>>> td.text
u'2013-06-13'

python - 使用python和BeautifulSoup从html中提取表格内容

1 回答 1

Related

Reference