python - 使用 dryscrape 和 BeautifulSoup 进行网页抓取

Question

我正在尝试从雅虎抓取一些数据。我写了一个有效的脚本 - 有时。有时，当我运行脚本时，我能够下载完整的页面 - 其他时候，页面只是部分加载 - 缺少数据部分。

更令人困惑的是，当我在浏览器中导航到该页面时，会显示整个页面。

这是我的代码的要点：

import dryscrape
from bs4 import BeautifulSoup

url =  'http://finance.yahoo.com/quote/SPY/options?p=SPY&straddle=false'

sess = dryscrape.Session()

sess.set_header('user-agent', 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0')

sess.set_attribute('auto_load_images', False)          
sess.set_timeout(360)

sess.visit(url)

soup = BeautifulSoup(sess.body(), 'lxml')

# Related to memory leak issue in webkit
sess.reset()

# Barfs (sometimes!) at the line below
sel_list = soup.find('select', class_='Fz(s)')

if sel_list is None or len(sel_list) == 0:
    print('element not found on page!')

我附上了下面获取的页面的图像。这是通过网络浏览器在互联网上查看时的网页：

现在，这是我通过类似于上面显示的脚本拉下的页面 - 它没有数据！：

谁能弄清楚为什么当我的脚本获取数据时元素有时会丢失？同样（更多？）重要的是，我该如何解决这个问题？

score 5 · Accepted Answer

您可能需要等待数据加载后才能使用BeautifulSoup. dryscrape等待可以通过函数wait_for()完成：

sess.visit(url)

# waiting for the first data row in a table to be present
sess.wait_for(lambda: session.at_css("tr.data-row0"))

soup = BeautifulSoup(sess.body(), 'lxml')

或者，在黑暗中拍摄：它也可能是一个临时（网络？）问题，您可以通过循环刷新页面直到看到结果来解决它，类似于以下内容：

from dryscrape.mixins import WaitTimeoutError 

ATTEMPTS_COUNT = 5
attempts = 0

while attempts <= ATTEMPTS_COUNT:
    sess.visit(url)

    try:
        # waiting for the first data row in a table to be present
        sess.wait_for(lambda: session.at_css("tr.data-row0"))
        break
    except WaitTimeoutError:
        print("Data row has not appeared, retrying...")
        attempts += 1

soup = BeautifulSoup(sess.body(), 'lxml')

python - 使用 dryscrape 和 BeautifulSoup 进行网页抓取

1 回答 1

Related

Reference