javascript - Python Selenium 无法通过链接。巴斯宾爬行

Question

您好，我正在尝试提取我给出的 10 页中的所有链接以进行搜索ssh。

我可以从第一页提取前 10 个链接，在加载 JavaScript 后，我可以单击一次，第一页，然后提取接下来的 10 个链接，但是，当尝试转到第三页时，我得到一个错误。

这是我的代码：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import requests
import re

links = []
driver = webdriver.Firefox()
driver.get("http://pastebin.com/search?q=ssh")

# wait for the search results to be loaded
wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".gsc-result-info")))
for link in driver.find_elements_by_xpath("//div[@class='gs-title']/a[@class='gs-title']"):
        if link.get_attribute("href") != None:
            print link.get_attribute("href")
# get all search results links
for page in driver.find_elements_by_xpath("//div[@class='gsc-cursor-page']"):
    driver.implicitly_wait(10) # seconds
    page.click()

    for link in driver.find_elements_by_xpath("//div[@class='gs-title']/a[@class='gs-title']"):
        if link.get_attribute("href") != None:
            print link.get_attribute("href")

这就是我能够获得的，以及我犯的错误：

python pastebinselenium.py 
http://pastebin.com/u/ssh
http://pastebin.com/gsQWBEZP
http://pastebin.com/gfA12TWk
http://pastebin.com/udWMWdPR
http://pastebin.com/J55238CB
http://pastebin.com/DN2aHvRr
http://pastebin.com/f0rh66kU
http://pastebin.com/3zvY3DSm
http://pastebin.com/fqHVJGEm
http://pastebin.com/3aB7h0fm
http://pastebin.com/3uBAxXu3
http://pastebin.com/cxjRqeSh
http://pastebin.com/5nJPNr3Q
http://pastebin.com/qV0rPNfP
http://pastebin.com/zubt2Yc7
http://pastebin.com/jFrjWYpE
http://pastebin.com/DU7yqjQ1
http://pastebin.com/AFtWHmtE
http://pastebin.com/UVP5behK
http://pastebin.com/hP7XTyv1
Traceback (most recent call last):
  File "pastebinselenium.py", line 21, in <module>
    page.click()
  File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webelement.py", line 74, in click
    self._execute(Command.CLICK_ELEMENT)
  File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webelement.py", line 457, in _execute
    return self._parent.execute(command, params)
  File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webdriver.py", line 233, in execute
    self.error_handler.check_response(response)
  File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/errorhandler.py", line 194, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.StaleElementReferenceException: Message: Element not found in the cache - perhaps the page has changed since it was looked up
Stacktrace:
    at fxdriver.cache.getElementAt (resource://fxdriver/modules/web-element-cache.js:9454)
    at Utils.getElementAt (file:///tmp/tmpzhZSEC/extensions/fxdriver@googlecode.com/components/command-processor.js:9039)
    at fxdriver.preconditions.visible (file:///tmp/tmpzhZSEC/extensions/fxdriver@googlecode.com/components/command-processor.js:10090)
    at DelayedCommand.prototype.checkPreconditions_ (file:///tmp/tmpzhZSEC/extensions/fxdriver@googlecode.com/components/command-processor.js:12644)
    at DelayedCommand.prototype.executeInternal_/h (file:///tmp/tmpzhZSEC/extensions/fxdriver@googlecode.com/components/command-processor.js:12661)
    at fxdriver.Timer.prototype.setTimeout/<.notify (file:///tmp/tmpzhZSEC/extensions/fxdriver@googlecode.com/components/command-processor.js:625)

我想从 10 个页面中提取 10 个链接（总共 100 个），我只能提取 20 个 =(

我也试过这个：

wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".gsc-cursor-box")))

就在之前click，但没有成功。

score 3 · Accepted Answer

这个想法是在循环中单击分页链接，等待下一个页码在途中成为活动的收集链接。执行：

from pprint import pprint

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


driver = webdriver.Firefox()
driver.get("http://pastebin.com/search?q=ssh")

# wait for the search results to be loaded
wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".gsc-result-info")))

links = [link.get_attribute("href") for link in driver.find_elements_by_css_selector(".gsc-results .gs-result > .gsc-thumbnail-inside > .gs-title > a.gs-title")]
for page_number in range(2, 11):
    driver.find_element_by_xpath("//div[@class='gsc-cursor-page' and . = '%d']" % page_number).click()

    wait.until(EC.visibility_of_element_located((By.XPATH, "//div[contains(@class, 'gsc-cursor-current-page') and . = '%d']" % page_number)))

    links.extend([link.get_attribute("href") for link in driver.find_elements_by_css_selector(".gsc-results .gs-result > .gsc-thumbnail-inside > .gs-title > a.gs-title")])

print(len(links))
pprint(links)

印刷：

100
['http://pastebin.com/u/ssh',
 'http://pastebin.com/gsQWBEZP',
  ...
 'http://pastebin.com/vtBgrndi',
 'http://pastebin.com/WgXrebLq',
 'http://pastebin.com/Nxui56Gh',
 'http://pastebin.com/Qef0LZPR',
 'http://pastebin.com/yNUh1fRe',
 'http://pastebin.com/2j0d8FzL',
 'http://pastebin.com/g92A2jAq']

javascript - Python Selenium 无法通过链接。巴斯宾爬行

1 回答 1

Related

Reference