10

我正在尝试从具有我知道的特定 ID 的表中获取数据。出于某种原因,代码一直给我一个 None 结果。

从我试图解析的 HTML 代码:

<table cellspacing="0" cellpadding="3" border="0" id="ctl00_SPWebPartManager1_g_c001c0d9_0cb8_4b0f_b75a_7cc3b6f7d790_ctl00_HistoryData1_gridHistoryData_DataGrid1" style="width:100%;border-collapse:collapse;">
    <tr class="gridHeader" valign="top">
        <td class="titleGridRegNoB" align="center" valign="top"><span dir=RTL>שווי שוק (אלפי ש"ח)</span></td>
        <td class="titleGridReg" align="center" valign="top">הון רשום למסחר</td>
        <td class="titleGridReg" align="center" valign="top">שער נמוך</td><td class="titleGridReg" align="center" valign="top">שער גבוה</td>
        <td class="titleGridReg" align="center" valign="top">שער בסיס</td>
        <td class="titleGridReg" align="center" valign="top">שער פתיחה</td><td class="titleGridReg" align="center" valign="top"><span dir="rtl">שער נעילה (באגורות)</span></td>
        <td class="titleGridReg" align="center" valign="top">שער נעילה מתואם</td><td class="titleGridReg" align="center" valign="top">תאריך</td>
    </tr>
    <tr onmouseover="this.style.backgroundColor='#FDF1D7'" onmouseout="this.style.backgroundColor='#ffffff'">

... 等等

我的代码:

html = br.response().read()
soup = BeautifulSoup(html)

table = soup.find(lambda tag: tag.name=='table' and tag.has_key('id') and tag['id']=="ctl00_SPWebPartManager1_g_c001c0d9_0cb8_4b0f_b75a_7cc3b6f7d790_ctl00_HistoryData1_gridHistoryData_DataGrid1")
rows = table.findAll(lambda tag: tag.name=='tr')

In [100]: print table
None
4

2 回答 2

18

文档中:

table = soup.find('table', id="ctl00_SPWebPartManager1_g_c001c0d9_0cb8_4b0f_b75a_7cc3b6f7d790_ctl00_HistoryData1_gridHistoryData_DataGrid1")

而对于行线:

rows = table.findAll('tr')

对于编码问题,尝试从 解码utf-8,然后重新编码。

html = br.response().read().decode('utf-8')
soup = BeautifulSoup(html.encode('utf-8'))
于 2013-10-25T14:06:19.317 回答
1

改进 aiKid 的回答:

# coding=utf-8
from bs4 import BeautifulSoup

html = u"""
<table cellspacing="0" cellpadding="3" border="0" id="ctl00_SPWebPartManager1_g_c001c0d9_0cb8_4b0f_b75a_7cc3b6f7d790_ctl00_HistoryData1_gridHistoryData_DataGrid1" style="width:100%;border-collapse:collapse;">
                            <tr class="gridHeader" valign="top">
                                <td class="titleGridRegNoB" align="center" valign="top"><span dir=RTL>שווי שוק (אלפי ש"ח)</span></td><td class="titleGridReg" align="center" valign="top">הון רשום למסחר</td><td class="titleGridReg" align="center" valign="top">שער נמוך</td><td class="titleGridReg" align="center" valign="top">שער גבוה</td><td class="titleGridReg" align="center" valign="top">שער בסיס</td><td class="titleGridReg" align="center" valign="top">שער פתיחה</td><td class="titleGridReg" align="center" valign="top"><span dir="rtl">שער נעילה (באגורות)</span>
</td><td class="titleGridReg" align="center" valign="top">שער נעילה מתואם</td><td class="titleGridReg" align="center" valign="top">תאריך</td>
                            </tr><tr onmouseover="this.style.backgroundColor='#FDF1D7'" onmouseout="this.style.backgroundColor='#ffffff'">
"""

soup = BeautifulSoup(html)
print soup.find_all("table",
                    id="ctl00_SPWebPartManager1_g_c001c0d9_0cb8_4b0f_b75a_7cc3b6f7d790_ctl00_HistoryData1_gridHistoryData_DataGrid1")

由于您使用的是 UTF-8 数据,因此您需要将字符串设置为 unicode 字符串,如下所示u"""(...)"""。使用 unicode 所需要做的就是:

br.response().read().decode('utf-8')

以上将为您提供一个 ASCII 字符串,您可以稍后将其编码为 un​​icode。比如,假设字符串存储在 中html,您可以使用html.encode("utf-8"). 如果你这样做,你不需要把 放在u任何东西的前面。您可以再次将所有内容视为常规字符串。

于 2013-10-25T14:17:55.080 回答