python - 如何使用 beautifulSoup 从网站中提取和下载所有图像？

Question

我正在尝试从 url 中提取和下载所有图像。我写了一个脚本

import urllib2
import re
from os.path import basename
from urlparse import urlsplit

url = "http://filmygyan.in/katrina-kaifs-top-10-cutest-pics-gallery/"
urlContent = urllib2.urlopen(url).read()
# HTML image tag: <img src="url" alt="some_text"/>
imgUrls = re.findall('img .*?src="(.*?)"', urlContent)

# download all images
for imgUrl in imgUrls:
    try:
        imgData = urllib2.urlopen(imgUrl).read()
        fileName = basename(urlsplit(imgUrl)[2])
        output = open(fileName,'wb')
        output.write(imgData)
        output.close()
    except:
        pass

我不想提取此页面的图像请参阅此图像http://i.share.pho.to/1c9884b1_l.jpeg 我只想获取所有图像而不单击“下一步”按钮我没有得到怎么能我得到了“下一个”类中的所有图片。我应该在 findall 中做哪些更改？

score 40 · Accepted Answer

下面应该从给定页面中提取所有图像并将其写入运行脚本的目录。

import re
import requests
from bs4 import BeautifulSoup

site = 'http://pixabay.com'

response = requests.get(site)

soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')

urls = [img['src'] for img in img_tags]


for url in urls:
    filename = re.search(r'/([\w_-]+[.](jpg|gif|png))$', url)
    if not filename:
         print("Regex didn't match with the url: {}".format(url))
         continue
    with open(filename.group(1), 'wb') as f:
        if 'http' not in url:
            # sometimes an image source can be relative 
            # if it is provide the base url which also happens 
            # to be the site variable atm. 
            url = '{}{}'.format(site, url)
        response = requests.get(url)
        f.write(response.content)

score 1 · Accepted Answer

对乔纳森的回答稍作修改（因为我无法发表评论）：将“www”添加到网站将修复大多数“不支持文件类型”错误。

import re
import requests
from bs4 import BeautifulSoup

site = 'http://www.google.com'

response = requests.get(site)

soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')

urls = [img['src'] for img in img_tags]


for url in urls:
    filename = re.search(r'/([\w_-]+[.](jpg|gif|png))$', url)
    if not filename:
         print("Regex didn't match with the url: {}".format(url))
         continue
    with open(filename.group(1), 'wb') as f:
        if 'http' not in url:
            # sometimes an image source can be relative 
            # if it is provide the base url which also happens 
            # to be the site variable atm. 
            url = '{}{}'.format(site, url)
        response = requests.get(url)
        f.write(response.content)

score -6 · Accepted Answer

如果您只想要图片，那么您可以直接下载它们而无需删除网页。都具有相同的 URL：

http://filmygyan.in/wp-content/gallery/katrina-kaifs-top-10-cutest-pics-gallery/cute1.jpg
http://filmygyan.in/wp-content/gallery/katrina-kaifs-top-10-cutest-pics-gallery/cute2.jpg
...
http://filmygyan.in/wp-content/gallery/katrina-kaifs-top-10-cutest-pics-gallery/cute10.jpg

如此简单的代码将为您提供所有图像：

import os
import urllib
import urllib2


baseUrl = "http://filmygyan.in/wp-content/gallery/katrina-kaifs-top-10-"\
      "cutest-pics-gallery/cute%s.jpg"

for i in range(1,11):
    url = baseUrl % i
    urllib.urlretrieve(url, os.path.basename(url))

使用 Beautifulsoup，您必须单击或转到下一页来删除图像。如果您想单独废弃每个页面，请尝试使用那里的类来刮擦它们shutterset_katrina-kaifs-top-10-cutest-pics-gallery

python - 如何使用 beautifulSoup 从网站中提取和下载所有图像？

3 回答 3

Related

Reference