0

我想解析一个大约 8 页的IMDb 电影评级。为了做到这一点,我正在使用 Selenium,但我在点击时遇到了问题,将算法进行到下一页。最后,当我继续使用 BeautifulSoup 时,我需要 1000 个标题。下面的代码不起作用,我需要在这个 HTML 中使用按钮“NEXT”:

<a class="flat-button lister-page-next next-page" href="/list/ls000004717/?page=2">
            Next
        </a>

这是代码:

from selenium import webdriver as wb
browser = wb.Chrome()
browser.get('https://www.imdb.com/list/ls000004717/')
field = browser.find_element_by_name("flat-button lister-page-next next-page").click()

错误如下:

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":".flat-button lister-page-next next-page"}
  (Session info: chrome=78.0.3904.108)

我想我缺乏所需的语法知识,或者我把它弄混了一点。我尝试在 SO 上进行搜索,尽管每个示例都非常独特,而且我不具备完全推断这些案例的知识。Selenium 有什么办法可以处理这个问题?

4

4 回答 4

2

您可以尝试使用 XPath 来查询Next按钮内的文本。您可能还应该调用WebDriverWait,因为您正在跨多个页面导航,然后滚动到视图中,因为它位于页面底部:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from time import sleep


browser = wb.Chrome()
browser.get('https://www.imdb.com/list/ls000004717/')

# keep clicking next until we reach the end
for i in range(0,9):

    # wait up to 10s before locating next button
    try:    
        next_button = WebDriverWait(browser, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[contains(@class, 'page') and contains(text(), 'Next')]")))

        # scroll down to button using Javascript
        browser.execute_script("arguments[0].scrollIntoView(true);", next_button)

        # click the button
    #    next_button.click() this throws exception -- replace with JS click
        browser.execute_script("arguments[0].click();", next_button)

        # I never recommend using sleep like this, but WebDriverWait is not waiting on next button to fully load, so it goes stale.
        sleep(5)

    # case: next button no longer exists, we have reached the end
    except TimeoutException:
        break

我还将所有内容都包装在一个try/except TimeoutException块中以处理我们已经到达页面末尾并且Next按钮不再存在的情况,从而打破了循环。这对我来说适用于多个页面。

我还必须添加一个明确的sleep(5),因为即使在调用之后WebDriverWaitelement_to_be_clickable仍然next_button在抛出StaleElementReferenceException. 似乎WebDriverWait在页面完全加载之前完成,导致状态next_button在它被定位后发生变化。通常添加sleep(5)是不好的做法,但这里似乎没有另一种解决方法。如果其他人对此有任何建议,请随时评论/编辑答案。

于 2019-12-12T17:50:07.543 回答
1

您可以尝试使用部分 CSS 选择器。

browser.find_element_by_css_selector("a[class*='next-page']").click()

于 2019-12-12T17:15:05.940 回答
1

有几种方法可以工作: 1. 为下一个按钮使用选择器并循环直到结束:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as ec

browser = webdriver.Chrome()
browser.get('https://www.imdb.com/list/ls000004717/')
selector = 'a[class*="next-page"]'

num_pages = 10
for page in range(pages):

    # Wait for the element to load
    WebDriverWait(browser, 10).until(ec.presence_of_element_located((By.CSS_SELECTOR, selector)))
    # ... Do rating parsing here

    browser.find_element_by_css_selector(selector).click()

除了单击元素之外,另一个选项可能是使用以下命令导航到下一页broswer.get('...')

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as ec

# Set up browser as before and navigate to the page
browser = webdriver.Chrome()
browser.get('https://www.imdb.com/list/ls000004717/')
selector = 'a[class*="next-page"]'
base_url = 'https://www.imdb.com/list/ls000004717/'
page_extension = '?page='

# Already at page = 1, so only needs to loop 9 times
for page in range(2, pages + 1):
    # Wait for the page to load
    WebDriverWait(browser, 10).until(ec.presence_of_element_located((By.CSS_SELECTOR, selector)))
    # ... Do rating parsing here

    next_page = base_url + page_extension + str(page)
    browser.get(next_page)

注意:field = browser.find_element_by_name("...").click()不会分配给 web 元素,因为该方法没有返回值fieldclick()

于 2019-12-12T18:01:43.627 回答
1

要单击带有文本作为NEXT的元素,直到901 - 1,000 of 1,000页面,您必须:

  • scrollIntoView()元素一旦visibility_of_element_located()实现。
  • 诱导WebDriverWaitelement_to_be_clickable()
  • 您可以使用以下解决方案:

    • 代码块:

      from selenium import webdriver
      from selenium.webdriver.support.ui import WebDriverWait
      from selenium.webdriver.common.by import By
      from selenium.webdriver.support import expected_conditions as EC
      from selenium.common.exceptions import TimeoutException
      
      options = webdriver.ChromeOptions() 
      options.add_argument("start-maximized")
      options.add_experimental_option("excludeSwitches", ["enable-automation"])
      options.add_experimental_option('useAutomationExtension', False)
      driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
      driver.get('https://www.imdb.com/list/ls000004717/')
      driver.execute_script("return arguments[0].scrollIntoView(true);", WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "span.pagination-range"))))
      while True:
          try:
              WebDriverWait(driver, 20).until(EC.invisibility_of_element((By.CSS_SELECTOR, "div.row.text-center.lister-working.hidden")))
              driver.execute_script("return arguments[0].scrollIntoView(true);", WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "span.pagination-range"))))
              WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a.flat-button.lister-page-next.next-page"))).click()
              print("Clicked on NEXT button")
          except TimeoutException as e:
              print("No more NEXT button")
              break
      driver.quit()
      
    • 控制台输出:

      Clicked on NEXT button
      Clicked on NEXT button
      Clicked on NEXT button
      Clicked on NEXT button
      Clicked on NEXT button
      Clicked on NEXT button
      Clicked on NEXT button
      Clicked on NEXT button
      Clicked on NEXT button
      No more NEXT button
      
于 2019-12-12T20:42:30.590 回答