python - 用于获取源页面中存在的表的 html 数据的 Python 代码

Question

我是 python 新手，我正在尝试抓取一个网站。我能够登录到一个网站并获得一个 html 页面，但我不需要整个页面，我只需要指定表中的超链接。

我已经编写了以下代码，但这会获取所有超链接。

soup = BeautifulSoup(the_page)
for table in soup.findAll('table',{'id':'ctl00_Main_lvMyAccount_Table1'} ):
        for link in soup.findAll('a'):
                print link.get('href')

谁能帮助我我哪里出错了？

下面是表格的html文本

<table id="ctl00_Main_lvMyAccount_Table1" width="680px">
 <tr id="ctl00_Main_lvMyAccount_Tr1">
    <td id="ctl00_Main_lvMyAccount_Td1">
                        <table id="ctl00_Main_lvMyAccount_itemPlaceholderContainer" border="1" cellspacing="0" cellpadding="3">
        <tr id="ctl00_Main_lvMyAccount_Tr2" style="background-color:#0090dd;">
            <th id="ctl00_Main_lvMyAccount_Th1"></th>
            <th id="ctl00_Main_lvMyAccount_Th2">

                                    <a id="ctl00_Main_lvMyAccount_SortByAcctNum" href="javascript:__doPostBack('ctl00$Main$lvMyAccount$SortByAcctNum','')">
                                        <font color=white>
                                            <span id="ctl00_Main_lvMyAccount_AcctNum">Account number</span>
                                        </font>

                                        </a>
                                </th>
            <th id="ctl00_Main_lvMyAccount_Th4">
                                    <a id="ctl00_Main_lvMyAccount_SortByServAdd" href="javascript:__doPostBack('ctl00$Main$lvMyAccount$SortByServAdd','')">
                                    <font color=white>
                                        <span id="ctl00_Main_lvMyAccount_ServiceAddress">Service address</span>
                                    </font>
                                    </a>
                                </th>
            <th id="ctl00_Main_lvMyAccount_Th5">
                                    <a id="ctl00_Main_lvMyAccount_SortByAcctName" href="javascript:__doPostBack('ctl00$Main$lvMyAccount$SortByAcctName','')">
                                    <font color=white>
                                        <span id="ctl00_Main_lvMyAccount_AcctName">Name</span>
                                    </font>
                                    </a>
                                </th>
            <th id="ctl00_Main_lvMyAccount_Th6">
                                    <a id="ctl00_Main_lvMyAccount_SortByStatus" href="javascript:__doPostBack('ctl00$Main$lvMyAccount$SortByStatus','')">
                                    <font color=white>
                                        <span id="ctl00_Main_lvMyAccount_AcctStatus">Account status</span>
                                    </font>
                                    </a>
                                </th>
            <th id="ctl00_Main_lvMyAccount_Th3"></th>
        </tr>


            <tr>
                <td>

提前致谢。

score 1 · Accepted Answer

好吧，这是正确的方法。

soup = BeautifulSoup(the_page)
for table in soup.findAll('table',{'id':'ctl00_Main_lvMyAccount_Table1'} ): 
        for link in table.findAll('a'): #search for links only in the table
                print link['href'] #get the href attribute

此外，您可以跳过父循环，因为指定的 id 只有一个匹配项：

soup = BeautifulSoup(the_page)
table = soup.find('table',{'id':'ctl00_Main_lvMyAccount_Table1'})
for link in table.findAll('a'): #search for links only in the table
                print link['href'] #get the href attribute

更新：注意到@DSM 所说的。修复了表分配中缺少的引号。

score 0 · Accepted Answer

确保您的 for 循环在表 html 中查找（而不是soup变量，即页面 html）：

from bs4 import BeautifulSoup

page = BeautifulSoup(the_page)
table = page.find('table', {'id': 'ctl00_Main_lvMyAccount_Table1'})
links = table.findAll('a')

# Print href
for link in links:
   link['href']

结果

In [8]: table = page.find('table', {'id' : 'ctl00_Main_lvMyAccount_Table1'})

In [9]: links = table.findAll('a')

In [10]: for link in links:
   ....:     print link['href']
   ....:     
javascript:__doPostBack('ctl00$Main$lvMyAccount$SortByAcctNum','')
javascript:__doPostBack('ctl00$Main$lvMyAccount$SortByServAdd','')
javascript:__doPostBack('ctl00$Main$lvMyAccount$SortByAcctName','')
javascript:__doPostBack('ctl00$Main$lvMyAccount$SortByStatus','')

score 0 · Accepted Answer

您的嵌套循环for link in soup.findAll('a'):正在搜索整个 HTML 页面。如果您想在表中搜索链接，请将该行更改为：

for link in table.findAll('a'):

python - 用于获取源页面中存在的表的 html 数据的 Python 代码

3 回答 3

结果

Related

Reference