python - 使用带有动态链接的 BeautifulSoup Python 进行解析

Question

我试图解析此站点上列出的表信息：

https://www.theice.com/productguide/ProductSpec.shtml;jsessionid=7A651D7E9437F76904BEC5623DBAB055?specId=19118104#expiry

这是我正在使用的以下代码：

link = re.findall(re.compile('<a href="(.*?)">'), str(row))
link = 'https://www.theice.com'+link[0]
print link #Double check if link is correct
user_agent = 'Mozilla/20.0.1 (compatible; MSIE 5.5; Windows NT)'
headers = {'User-Agent':user_agent}
req = urllib2.Request(link, headers = headers)
try:
    pg = urllib2.urlopen(req).read()
    page = BeautifulSoup(pg)
except urllib2.HTTPError, e:
    print 'Error:', e.code, '\n', '\n'

table = page.find('table', attrs = {'class':'default'})
tr_odd = table.findAll('tr', attrs = {'class':'odd'})
tr_even = table.findAll('tr', attrs = {'class':'even'})
print tr_odd, tr_even

出于某种原因，在该urllib2.urlopen(req).read()步骤中，链接发生了变化，即，link不包含与上面提供的相同的 url。因此，我的程序打开一个不同的页面，变量page存储来自这个新的不同站点的信息。因此， mytr_odd和tr_evenvariables 为 NULL。

链接更改的原因可能是什么？还有其他方法可以访问此页面的内容吗？我需要的只是表格值。

score 1 · Accepted Answer

此页面中的信息由 JavaScript 函数提供。当您下载页面时，您会在执行 JavaScript之前urllib获得该页面。当您在标准浏览器中手动查看页面时，您会看到JavaScript 执行后的 HTML。

要以编程方式获取数据，您需要使用一些可以执行 JavaScript 的工具。有许多可用于 Python 的 3rd 方选项，例如selenium、WebKit或spidermonkey。

这是一个如何使用 selenium （使用phantomjs）和lxml抓取页面的示例：

import contextlib
import selenium.webdriver as webdriver
import lxml.html as LH
link = 'https://www.theice.com/productguide/ProductSpec.shtml;jsessionid=7A651D7E9437F76904BEC5623DBAB055?specId=19118104#expiry'

with contextlib.closing(webdriver.PhantomJS('phantomjs')) as driver:
    driver.get(link)
    content = driver.page_source
    doc = LH.fromstring(content)
    tds = doc.xpath(
        '//table[@class="default"]//tr[@class="odd" or @class="even"]/td/text()')
    print('\n'.join(map(str, zip(*[iter(tds)]*5))))

产量

('Jul13', '2/11/13', '7/26/13', '7/26/13', '7/26/13')
('Aug13', '2/11/13', '8/30/13', '8/30/13', '8/30/13')
('Sep13', '2/11/13', '9/27/13', '9/27/13', '9/27/13')
('Oct13', '2/11/13', '10/25/13', '10/25/13', '10/25/13')
...
('Aug18', '2/11/13', '8/31/18', '8/31/18', '8/31/18')
('Sep18', '2/11/13', '9/28/18', '9/28/18', '9/28/18')
('Oct18', '2/11/13', '10/26/18', '10/26/18', '10/26/18')
('Nov18', '2/11/13', '11/30/18', '11/30/18', '11/30/18')
('Dec18', '2/11/13', '12/28/18', '12/28/18', '12/28/18')

XPath 的解释：

lxml允许您使用XPath选择标签。XPath

'//table[@class="default"]//tr[@class="odd" or @class="even"]/td/text()'

方法

//table    # search recursively for <table>
  [@class="default"]  # with an attribute class="default"
  //tr     # and find inside <table> all <tr> tags
    [@class="odd" or @class="even"]   # that have attribute class="odd" or class="even"
    /td      # find the <td> tags which are direct children of the <tr> tags  
      /text()  # return the text inside the <td> tag

解释zip(*[iter(tds)]*5)：

是tds一个列表。它看起来像

['Jul13', '2/11/13', '7/26/13', '7/26/13', '7/26/13', 'Aug13', '2/11/13', '8/30/13', '8/30/13', '8/30/13',...]

请注意，表格的每一行包含 5 个项目。但我们的名单是平的。因此，要将每 5 个项目组合成一个元组，我们可以使用grouper recipe。zip(*[iter(tds)]*5)是石斑鱼配方的应用。它需要一个平面列表，例如tds，并将其转换为每 5 个项目组合在一起的元组列表。

以下是石斑鱼食谱如何运作的解释。请阅读，如果您对此有任何疑问，我将很乐意尝试回答。

要仅获取 table 的第一列，请将 XPath 更改为：

tds = doc.xpath(
    '''//table[@class="default"]
         //tr[@class="odd" or @class="even"]
           /td[1]/text()''')
print(tds)

例如，

import contextlib
import selenium.webdriver as webdriver
import lxml.html as LH
link = 'https://www.theice.com/productguide/ProductSpec.shtml?specId=6753474#expiry'
with contextlib.closing(webdriver.PhantomJS('phantomjs')) as driver:
    driver.get(link)
    content = driver.page_source
    doc = LH.fromstring(content)
    tds = doc.xpath(
        '''//table[@class="default"]
             //tr[@class="odd" or @class="even"]
               /td[1]/text()''')
    print(tds)

产量

['Jul13', 'Aug13', 'Sep13', 'Oct13', 'Nov13', 'Dec13', 'Jan14', 'Feb14', 'Mar14', 'Apr14', 'May14', 'Jun14', 'Jul14', 'Aug14', 'Sep14', 'Oct14', 'Nov14', 'Dec14', 'Jan15', 'Feb15', 'Mar15', 'Apr15', 'May15', 'Jun15', 'Jul15', 'Aug15', 'Sep15', 'Oct15', 'Nov15', 'Dec15']

score 0 · Accepted Answer

我认为链接实际上并没有改变。

无论如何，问题是你的正则表达式是错误的。如果你把它打印出来的链接粘贴到浏览器中，你会得到一个空白页面，或者错误的页面，或者重定向到错误的页面。Python 将下载完全相同的东西。

这是实际页面的链接：

<a href="/productguide/MarginRates.shtml;jsessionid=B53D8EF107AAC5F37F0ADF627B843B58?index=&amp;specId=19118104" class="marginrates"></a>

这是您的正则表达式发现的内容：

/productguide/MarginRates.shtml;jsessionid=B53D8EF107AAC5F37F0ADF627B843B58?index=&amp;specId=19118104

注意到&那里了吗？您需要将其解码为&或您的 URL 错误。你有一个查询字符串变量，而不是一个specId带有 value19118104的查询字符串变量amp;specId（虽然从技术上讲，你也不能有这样的非转义分号，所以从后面的所有内容jsession都是一个片段）。

您会注意到，如果将第一个粘贴到浏览器中，您会得到一个空白页面。我删除了多余的amp;，然后你得到了正确的页面（重定向后）。在 Python 中也是如此。

python - 使用带有动态链接的 BeautifulSoup Python 进行解析

2 回答 2

Related

Reference