4

我有一个程序可以在各种网站上下载照片。每个 url 由代码在地址末尾形成,这些代码在数据帧中访问。在 8,583 行的数据框中

这些网站有 javascript,所以我使用 selenium 来访问照片的 src。我用 urllib.request.urlretrieve 下载它

照片网站示例: http: //divulgacandcontas.tse.jus.br/divulga/#/candidato/2018/2022802018/PB/150000608817

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
from bs4 import BeautifulSoup
import time
import urllib.request, urllib.parse, urllib.error

# Root URL of the site that is accessed to fetch the photo link
url_raiz = 'http://divulgacandcontas.tse.jus.br/divulga/#/candidato/2018/2022802018/'

# Accesses the dataframe that has the "sequencial" type codes
candidatos = pd.read_excel('candidatos_2018.xlsx',sheet_name='Sheet1', converters={'sequencial': lambda x: str(x), 'cpf': lambda x: str(x),'numero_urna': lambda x: str(x)})

# Function that opens each page and takes the link from the photo
def pegalink(url):
    profile = webdriver.FirefoxProfile()
    browser = webdriver.Firefox(profile)

    browser.get(url)
    time.sleep(10)

    html = browser.page_source
    soup = BeautifulSoup(html, "html.parser")
    browser.close()

    link = soup.find("img", {"class": "img-thumbnail img-responsive dvg-cand-foto"})['src']

    return link

# Function that downloads the photo and saves it with the code name "cpf"
def baixa_foto(nome, url):
      urllib.request.urlretrieve(url, nome)


# Iteration in the dataframe
for num, row in candidatos.iterrows():
    cpf = (row['cpf']).strip()
    uf = (row['uf']).strip()
    print(cpf)
    print("-/-")
    sequencial = (row['sequencial']).strip()

    # Creates full page address
    url = url_raiz + uf + '/' + sequencial

    link_foto = pegalink(url)

    baixa_foto(cpf, link_foto)

请我寻找指导:

  • 放置一个 try-Exception 类型以等待页面加载(我在读取 src 时出错 - 在多次点击后,网站需要十多秒才能加载)

  • 我想记录所有可能的错误 - 在文件或数据框中 - 写下给出错误的“顺序”代码并继续程序

有人知道该怎么做吗?以下指南非常有用,但我无法继续前进

我将我使用的数据和程序的一部分放在一个文件夹中,如果您想查看:https ://drive.google.com/drive/folders/1lAnODBgC5ZUDINzGWMcvXKTzU7tVZXsj?usp=sharing

4

1 回答 1

3

将您的代码放入:

   try:
       WebDriverWait(browser, 30).until(wait_for(page_has_loaded))
       # here goes your code
   except: Exception
            print "This is an unexpected condition!"

对于 waitForPageToLoad :

def page_has_loaded():
        page_state = browser.execute_script(
            'return document.readyState;'
        ) 
        return page_state == 'complete'

上面的 30 是以秒为单位的时间。您可以根据需要进行调整。

方法2:

class wait_for_page_load(object):

    def __init__(self, browser):
        self.browser = browser

    def __enter__(self):
        self.old_page = self.browser.find_element_by_tag_name('html')

    def page_has_loaded(self):
        new_page = self.browser.find_element_by_tag_name('html')
        return new_page.id != self.old_page.id

    def __exit__(self, *_):
        wait_for(self.page_has_loaded) 


def pegalink(url):
    profile = webdriver.FirefoxProfile()
    browser = webdriver.Firefox(profile)

    browser.get(url)

    try:
        with wait_for_page_load(browser):
            html = browser.page_source
            soup = BeautifulSoup(html, "html.parser")
            browser.close()
            link = soup.find("img", {"class": "img-thumbnail img-responsive dvg-cand-foto"})['src']

    except Exception:
        print ("This is an unexpected condition!")
        print("Erro em: ", url)
        link = "Erro"

    return link
于 2018-08-23T10:07:15.567 回答