python - 使用 selenium python 下载图像

Question

我想从浏览器获取验证码图像。我有这张图片的 url，但是这张图片每次更新都会改变（url 是不变的）。

有没有从浏览器获取图片的解决方案（比如“将图片另存为”按钮）？

另一方面，我认为它应该是工作：

获取浏览器截图
获取图片位置
使用opencv从屏幕截图中裁剪验证码

动态验证码的链接 -链接

问题通过截图解决：

browser.save_screenshot('screenshot.png')
img = browser.find_element_by_xpath('//*[@id="cryptogram"]')
loc = img.location

image = cv.LoadImage('screenshot.png', True)
out = cv.CreateImage((150,60), image.depth, 3)
cv.SetImageROI(image, (loc['x'],loc['y'],150,60))
cv.Resize(image, out)
cv.SaveImage('out.jpg', out)

谢谢

score 59 · Accepted Answer

这是一个完整的例子（使用谷歌的recaptcha作为目标）：

import urllib
from selenium import webdriver

driver = webdriver.Firefox()
driver.get('http://www.google.com/recaptcha/demo/recaptcha')

# get the image source
img = driver.find_element_by_xpath('//div[@id="recaptcha_image"]/img')
src = img.get_attribute('src')

# download the image
urllib.urlretrieve(src, "captcha.png")

driver.close()

更新：

动态生成图像的问题是每次请求时都会生成一个新图像。在这种情况下，您有几种选择：

截图

from selenium import webdriver

driver = webdriver.Firefox()
driver.get('https://moscowsg.megafon.ru/ps/scc/php/cryptographp.php?PHPSESSID=mfc540jkbeme81qjvh5t0v0bnjdr7oc6&ref=114&w=150')

driver.save_screenshot("screenshot.png")

driver.close()

模拟右键单击+“另存为”。有关更多信息，请参阅此线程。

希望有帮助。

score 21 · Accepted Answer

可以保存整个页面的截图然后剪切图像，但你也可以使用“webdriver”中的“find”方法找到要保存的图像，并编写“screenshot_as_png”属性如下：

from selenium import webdriver
driver = webdriver.Firefox()
driver.get('https://www.webpagetest.org/')
with open('filename.png', 'wb') as file:
    file.write(driver.find_element_by_xpath('/html/body/div[1]/div[5]/div[2]/table[1]/tbody/tr/td[1]/a/div').screenshot_as_png)

有时它可能会因为滚动而出错，但根据您的需要，这是获取图像的好方法。

score 3 · Accepted Answer

使用的问题save_screenshot是我们无法将图像保存为原始质量，也无法恢复图像中的 alpha 通道。因此，我提出另一种解决方案。这是使用selenium-wire@codam_hsmits 建议的库的完整示例。可以通过下载图像ChromeDriver。

我定义了以下函数来解析每个请求并在必要时将请求正文保存到文件中。

from seleniumwire import webdriver  # Import from seleniumwire
from urllib.parse import urlparse
import os
from mimetypes import guess_extension
import time
import datetime

def download_assets(requests,
                   asset_dir="temp",
                   default_fname="unnamed",
                   skip_domains=["facebook", "google", "yahoo", "agkn", "2mdn"],
                   exts=[".png", ".jpeg", ".jpg", ".svg", ".gif", ".pdf", ".bmp", ".webp", ".ico"],
                   append_ext=False):
    asset_list = {}
    for req_idx, request in enumerate(requests):
        # request.headers
        # request.response.body is the raw response body in bytes
        if request is None or request.response is None or request.response.headers is None or 'Content-Type' not in request.response.headers:
            continue
            
        ext = guess_extension(request.response.headers['Content-Type'].split(';')[0].strip())
        if ext is None or ext == "" or ext not in exts:
            #Don't know the file extention, or not in the whitelist
            continue
        parsed_url = urlparse(request.url)
        
        skip = False
        for d in skip_domains:
            if d in parsed_url.netloc:
                skip = True
                break
        if skip:
            continue
        
        frelpath = parsed_url.path.strip()
        if frelpath == "":
            timestamp = str(datetime.datetime.now().replace(microsecond=0).isoformat())
            frelpath = f"{default_fname}_{req_idx}_{timestamp}{ext}"
        elif frelpath.endswith("\\") or frelpath.endswith("/"):
            timestamp = str(datetime.datetime.now().replace(microsecond=0).isoformat())
            frelpath = frelpath + f"{default_fname}_{req_idx}_{timestamp}{ext}"
        elif append_ext and not frelpath.endswith(ext):
            frelpath = frelpath + f"_{default_fname}{ext}" #Missing file extension but may not be a problem
        if frelpath.startswith("\\") or frelpath.startswith("/"):
            frelpath = frelpath[1:]
        
        fpath = os.path.join(asset_dir, parsed_url.netloc, frelpath)
        if os.path.isfile(fpath):
            continue
        os.makedirs(os.path.dirname(fpath), exist_ok=True)
        print(f"Downloading {request.url} to {fpath}")
        asset_list[fpath] = request.url
        try:
            with open(fpath, "wb") as file:
                file.write(request.response.body)
        except:
            print(f"Cannot download {request.url} to {fpath}")
    return asset_list

让我们从 Google 主页下载一些图像到temp文件夹。

# Create a new instance of the Chrome/Firefox driver
driver = webdriver.Chrome()

# Go to the Google home page
driver.get('https://www.google.com')

# Download content to temp folder
asset_dir = "temp"

while True:
    # Please browser the internet, it will collect the images for every second
    time.sleep(1)
    download_assets(driver.requests, asset_dir=asset_dir)

driver.close()

请注意，它无法决定哪些图像可以在页面上看到而不是隐藏在后台，因此用户应该主动单击按钮或链接来触发新的下载请求。

score 0 · Accepted Answer

这里是。

使用 Selenium WebDriver 打开图像
提取图像的宽度和高度BeautifulSoup
用设置正确的当前窗口大小driver.set_window_size，然后用截图driver.save_screenshot

from bs4 import BeautifulSoup
from selenium import webdriver
 
import os
from urllib.parse import urlparse
 
url = 'https://image.rakuten.co.jp/azu-kobe/cabinet/hair1/hb-30-pp1.jpg'
 
filename = os.path.basename(urlparse(url).path)
filename_png = os.path.splitext(filename)[0] + '.png'  # change file extension to .png
 
opts = webdriver.ChromeOptions()
opts.headless = True
driver = webdriver.Chrome(options=opts)
 
driver.get(url)
 
# Get the width and height of the image
soup = BeautifulSoup(driver.page_source, 'lxml')
width = soup.find('img')['width']
height = soup.find('img')['height']
 
driver.set_window_size(width, height) # driver.set_window_size(int(width), int(height))
driver.save_screenshot(filename_png)

它也适用于 Google 的图像格式 WebP。

请参阅通过使用 Selenium WebDriver 截屏来下载 Google 的 WebP 图像。

score 0 · Accepted Answer

因此，为了保持相关性，这里有一个 2020 解决方案seleniumwire，它是一个包，可让您访问在浏览器中发出的请求。您可以按如下方式轻松使用它：

from seleniumwire import webdriver

# Sometimes, selenium randomly crashed when using seleniumwire, these options fixed that.
# Probably has to do with how it proxies everything.
options.add_argument('--ignore-certificate-errors')
options.add_argument('--ignore-ssl-errors')

driver = webdriver.Chrome(chrome_options=options)
driver.get("https://google.com")

for request in driver.requests:
    # request.path
    # request.method
    # request.headers
    # request.response is the response instance
    # request.response.body is the raw response body in bytes

    # if you are using it for a ton of requests, make sure to clear them:
    del driver.requests

现在，你为什么需要这个？好吧，例如绕过 ReCaptcha，或者绕过像 Incapsula 这样的东西。使用它需要您自担风险。

python - 使用 selenium python 下载图像

5 回答 5

Related

Reference