python-2.7 - Selenium + Geckodriver 故障排除

Question

我在 Python 中使用带有 selenium 的 Firefox gecko 驱动程序来抓取论坛帖子标题，并且遇到了一个我似乎无法弄清楚的问题。

~$ geckodriver --version
geckodriver 0.19.0

The source code of this program is available from
testing/geckodriver in https://hg.mozilla.org/mozilla-central.

This program is subject to the terms of the Mozilla Public License 2.0.
You can obtain a copy of the license at https://mozilla.org/MPL/2.0/.

我正试图从论坛上刮掉几年前的帖子标题，我的代码可以正常工作一段时间。我坐下来看着它运行了大约 20-30 分钟，它完全按照它应该做的事情做。然而，随后我启动脚本，然后上床睡觉，当我第二天早上醒来时，我发现它已经处理了大约 22,000 个帖子。我目前正在抓取的网站每页有 25 个帖子，因此它在崩溃之前通过了大约 880 个单独的 URL。

当它崩溃时，它会引发以下错误：

WebDriverException: Message: Tried to run command without establishing a connection

最初我的代码如下所示：

FirefoxProfile = webdriver.FirefoxProfile('/home/me/jupyter-notebooks/FirefoxProfile/')
firefox_capabilities = DesiredCapabilities.FIREFOX
firefox_capabilities['marionette'] = True

driver = webdriver.Firefox(FirefoxProfile, capabilities=firefox_capabilities)
for url in urls:
    driver.get(url)
    ### code to process page here ###
driver.close()

我也试过：

driver = webdriver.Firefox(FirefoxProfile, capabilities=firefox_capabilities)
for url in urls:
    driver.get(url)
    ### code to process page here ###
    driver.close()

和

for url in urls:
    driver = webdriver.Firefox(FirefoxProfile, capabilities=firefox_capabilities)
    driver.get(url)
    ### code to process page here ###
    driver.close()

我在所有 3 个场景中都遇到了相同的错误，但只有在它成功运行了一段时间之后，我不知道如何确定它失败的原因。

在成功处理数百个 url 后，如何确定为什么会出现此错误？还是有一些我没有遵循 Selenium/Firefox 来处理这么多页面的最佳实践？

score 0 · Accepted Answer

所有 3 个代码块都近乎完美，但有以下小缺陷：

您的第一个代码块是：

driver = webdriver.Firefox(FirefoxProfile, capabilities=firefox_capabilities)
for url in urls:
    driver.get(url)
    ### code to process page here ###
driver.close()

没有一个问题，代码块看起来很有希望。在最后一步中，Best Practices我们必须调用driver.quit()而不是driver.close()阻止webdriver驻留在System Memory. 您可以找到driver.close()&的区别。driver.quit() here

您的第二个代码块是：

driver = webdriver.Firefox(FirefoxProfile, capabilities=firefox_capabilities)
for url in urls:
    driver.get(url)
    ### code to process page here ###
    driver.close()

这个块很容易出错。一旦执行进入for()循环并在urlfinally 上运行，我们将关闭Browser Session/Instance. 因此，当执行开始第二次迭代的循环时，脚本会出错，driver.get(url)因为没有Active Browser Session.

您的第三个代码块是：

for url in urls:
    driver = webdriver.Firefox(FirefoxProfile, capabilities=firefox_capabilities)
    driver.get(url)
    ### code to process page here ###
    driver.close()

代码块看起来几乎没有与第一个代码块相同的问题。在最后一步中，我们必须调用driver.quit()而不是driver.close()阻止webdriver位于System Memory. 由于悬空webdriver实例会产生杂务并在某个时间点继续占用端口，WebDriver因此无法找到空闲端口或无法打开新的Browser Session/Connection. 因此，您将错误视为WebDriverException：消息：尝试在未建立连接的情况下运行命令

解决方案：

根据Best Practices尝试调用driver.quit()而不是driver.close()打开一个新WebDriver实例和一个新的Web Browser Session.

python-2.7 - Selenium + Geckodriver 故障排除

1 回答 1

解决方案 ：

Related

Reference

解决方案：