3

我正在尝试为亚马逊结果创建一个基本的网络爬虫。当我遍历结果时,有时会到达结果的第 5 页(有时只有第 2 页),然后StaleElementException抛出 a。当我在抛出异常后查看浏览器时,我可以看到驱动程序/页面没有向下滚动到页码所在的位置(底部栏)。

我的代码:

driver.get('https://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Daps&field-keywords=sonicare+toothbrush')

for page in range(1,last_page_number +1):

    driver.implicitly_wait(10)

    bottom_bar = driver.find_element_by_class_name('pagnCur')
    driver.execute_script("arguments[0].scrollIntoView(true);", bottom_bar)

    current_page_number = int(driver.find_element_by_class_name('pagnCur').text)

    if page == current_page_number:
        next_page = driver.find_element_by_xpath('//div[@id="pagn"]/span[@class="pagnLink"]/a[text()="{0}"]'.format(current_page_number+1))
        next_page.click()
        print('page #',page,': going to next page')
    else:
        print('page #: ', page,'error')

我看过这个问题,我猜可以应用类似的修复,但我不确定如何在页面上找到消失的东西。此外,根据打印语句发生的速度,我可以看到implicitly_wait(10)实际上并没有等待整整 10 秒。

异常指向以“driver.execute_script”开头的行。这是一个例外:

StaleElementReferenceException: Message: The element reference of <span class="pagnCur"> is stale; either the element is no longer attached to the DOM, it is not in the current frame context, or the document has been refreshed

有时我会得到一个 ValueError:

ValueError: invalid literal for int() with base 10: ''

所以这些错误/异常让我相信等待页面完全刷新是有问题的。

4

2 回答 2

3

此错误消息...

StaleElementReferenceException: Message: The element reference of <span class="pagnCur"> is stale; either the element is no longer attached to the DOM, it is not in the current frame context, or the document has been refreshed

...意味着该元素的先前引用现在已过时,并且该元素引用不再存在于页面的 DOM 上。

此问题背后的常见原因是:

  • 元素在 HTML 中的位置发生了变化。
  • 该元素不再附加到 DOM 树。
  • 元素所在的网页已刷新。
  • 之前的 element 实例已被JavaScriptAjaxCall刷新。

这个用例

保留您滚动浏览scrollIntoView()打印一些有用的调试消息的概念,我做了一些小的调整来诱导WebDriverWait,您可以使用以下解决方案:

  • 代码块:

    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    options = Options()
    options.add_argument("start-maximized")
    options.add_argument('disable-infobars')
    options.add_argument("--disable-extensions")
    driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
    driver.get("https://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Daps&field-keywords=sonicare+toothbrush")
    while True:
        try:
            current_page_number_element = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "span.pagnCur")))
            driver.execute_script("arguments[0].scrollIntoView(true);", current_page_number_element)
            current_page_number = current_page_number_element.get_attribute("innerHTML")
            WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "span.pagnNextArrow"))).click()
            print("page # {} : going to next page".format(current_page_number))
        except:
            print("page # {} : error, no more pages".format(current_page_number))
            break
    driver.quit()
    
  • 控制台输出:

    page # 1 : going to next page
    page # 2 : going to next page
    page # 3 : going to next page
    page # 4 : going to next page
    page # 5 : going to next page
    page # 6 : going to next page
    page # 7 : going to next page
    page # 8 : going to next page
    page # 9 : going to next page
    page # 10 : going to next page
    page # 11 : going to next page
    page # 12 : going to next page
    page # 13 : going to next page
    page # 14 : going to next page
    page # 15 : going to next page
    page # 16 : going to next page
    page # 17 : going to next page
    page # 18 : going to next page
    page # 19 : going to next page
    page # 20 : error, no more pages
    
于 2018-12-06T06:55:15.427 回答
3

如果您只想让脚本遍历所有结果页面,则不需要任何复杂的逻辑 - 只需在可能的情况下单击 Next 按钮:

from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.common.exceptions import TimeoutException

driver = webdriver.Chrome()

driver.get('https://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Daps&field-keywords=sonicare+toothbrush')

while True:
    try:
        wait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, 'a > span#pagnNextString'))).click()
    except TimeoutException:
        break

PS 另请注意,implicitly_wait(10)不应该等待整整 10 秒,而是等待最多 10 秒,让元素出现在 HTML DOM中。因此,如果在 1 或 2 秒内找到元素,则等待完成,您将不会等待休息 8-9 秒......

于 2018-12-05T22:09:51.920 回答