1

我在解析谷歌图片搜索结果时遇到问题。我试过用selenium webdriver. 它返回了 100 个结果,但速度很慢。我决定请求一个带有requests模块的页面,它只返回了 20 个结果。如何获得相同的 100 个结果?有什么方法可以分页吗?
这是selenium代码:

_url = r'imgurl=([^&]+)&'

for search_url in lines:
    driver.get(normalize_search_url(search_url))

    images = driver.find_elements(By.XPATH, u"//div[@class='rg_di']")
    print "{0} results for {1}".format(len(images), ' '.join(driver.title.split(' ')[:-3]))
    with open('urls/{0}.txt'.format(search_url.strip().replace('\t', '_')), 'ab') as f:
        for image in images:
            url = image.find_element(By.TAG_NAME, u"a")
            u = re.findall(_url, url.get_attribute("href"))
            for item in u:
                f.write(item)
                f.write('\n')

这是requests代码:

_url = r'imgurl=([^&]+)&'

for search_url in lines[:10]:
    print normalize_search_url(search_url)
    links = 0
    request = requests.get(normalize_search_url(search_url))
    soup = BeautifulSoup(request.text)
    file = 'cars2/{0}.txt'.format(search_url.strip().replace(' ', '_'))
    with open(file, 'ab') as f:
        for image in soup.find_all('a'):
            if 'imgurl' in image.get('href'):
                links += 1
            u = re.findall(_url, image.get("href"))
            for item in u:
                f.write(item)
                f.write('\n')
                print item
        print "{0} links extracted for {1}".format(links, ' '.join(soup.title.name.split(' ')[:-3]))
4

2 回答 2

1

我从来没有尝试过使用 selenium,但是您是否尝试过使用 Google 的搜索引擎 API?它可能对您有用:https ://developers.google.com/products/#google-search

此外,他们对 API 的限制是每天 100 个请求,所以我认为你不会超过 100 个

于 2014-06-30T19:51:07.253 回答
0

您可以使用beautifulsouprequests库来抓取 Google 图片,selenium这不是必需的。

要获得一批 100 张图像,您可以在查询参数中使用。"ijn=0"-> 100 张图片,"ijn=1"-> 200 张图片。

为了使用抓取完整分辨率的图像 URL,requestsbeautifulsoup需要通过 .从页面源代码中抓取数据regex

查找所有<script>标签:

soup.select('script')

regex通过<script>标签匹配图像数据:

matched_images_data = ''.join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))

通过以下方式匹配所需的图像(全分辨率大小)regex

# https://kodlogs.com/34776/json-decoder-jsondecodeerror-expecting-property-name-enclosed-in-double-quotes
# if you try to json.loads() without json.dumps() it will throw an error:
# "Expecting property name enclosed in double quotes"
matched_images_data_fix = json.dumps(matched_images_data)
matched_images_data_json = json.loads(matched_images_data_fix)

matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]",
                                                    matched_images_data_json)

bytes()使用and提取和解码它们decode()

for fixed_full_res_image in matched_google_full_resolution_images:
    original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
    original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')

还下载图像的在线 IDE 中的代码和完整示例:

import requests, lxml, re, json
from bs4 import BeautifulSoup

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
    "q": "pexels cat",
    "tbm": "isch", 
    "hl": "en",
    "ijn": "0",
}

html = requests.get("https://www.google.com/search", params=params, headers=headers)
soup = BeautifulSoup(html.text, 'lxml')


def get_images_data():

    print('\nGoogle Images Metadata:')
    for google_image in soup.select('.isv-r.PNCib.MSM1fd.BUooTd'):
        title = google_image.select_one('.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')['title']
        source = google_image.select_one('.fxgdke').text
        link = google_image.select_one('.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')['href']
        print(f'{title}\n{source}\n{link}\n')

    # this steps could be refactored to a more compact
    all_script_tags = soup.select('script')

    # # https://regex101.com/r/48UZhY/4
    matched_images_data = ''.join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))
    
    # https://kodlogs.com/34776/json-decoder-jsondecodeerror-expecting-property-name-enclosed-in-double-quotes
    # if you try to json.loads() without json.dumps it will throw an error:
    # "Expecting property name enclosed in double quotes"
    matched_images_data_fix = json.dumps(matched_images_data)
    matched_images_data_json = json.loads(matched_images_data_fix)

    # https://regex101.com/r/pdZOnW/3
    matched_google_image_data = re.findall(r'\[\"GRID_STATE0\",null,\[\[1,\[0,\".*?\",(.*),\"All\",', matched_images_data_json)

    # https://regex101.com/r/NnRg27/1
    matched_google_images_thumbnails = ', '.join(
        re.findall(r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]',
                   str(matched_google_image_data))).split(', ')

    print('Google Image Thumbnails:')  # in order
    for fixed_google_image_thumbnail in matched_google_images_thumbnails:
        # https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
        google_image_thumbnail_not_fixed = bytes(fixed_google_image_thumbnail, 'ascii').decode('unicode-escape')

        # after first decoding, Unicode characters are still present. After the second iteration, they were decoded.
        google_image_thumbnail = bytes(google_image_thumbnail_not_fixed, 'ascii').decode('unicode-escape')
        print(google_image_thumbnail)

    # removing previously matched thumbnails for easier full resolution image matches.
    removed_matched_google_images_thumbnails = re.sub(
        r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]', '', str(matched_google_image_data))

    # https://regex101.com/r/fXjfb1/4
    # https://stackoverflow.com/a/19821774/15164646
    matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]",
                                                       removed_matched_google_images_thumbnails)


    print('\nDownloading Google Full Resolution Images:')  # in order
    for index, fixed_full_res_image in enumerate(matched_google_full_resolution_images):
        # https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
        original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
        original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')
        print(original_size_img)



get_images_data()


-------------
'''
Google Images Metadata:
9,000+ Best Cat Photos · 100% Free Download · Pexels Stock Photos
pexels.com
https://www.pexels.com/search/cat/
...

Google Image Thumbnails:
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR2cZsuRkkLWXOIsl9BZzbeaCcI0qav7nenDvvqi-YSm4nVJZYyljRsJZv6N5vS8hMNU_w&usqp=CAU
...

Full Resolution Images:
https://images.pexels.com/photos/1170986/pexels-photo-1170986.jpeg?cs=srgb&dl=pexels-evg-culture-1170986.jpg&fm=jpg
https://images.pexels.com/photos/3777622/pexels-photo-3777622.jpeg?auto=compress&cs=tinysrgb&dpr=1&w=500
...
'''

或者,您可以使用来自 SerpApi的Google 图片API来实现相同的目的。这是一个带有免费计划的付费 API。

不同之处在于,您不必处理regex、绕过 Google 的阻止,并在发生崩溃时随着时间的推移对其进行维护。相反,您只需要遍历结构化 JSON 并获取您想要的数据。

要集成的代码:

import os, json # json for pretty output
from serpapi import GoogleSearch

def get_google_images():
    params = {
      "api_key": os.getenv("API_KEY"),
      "engine": "google",
      "q": "pexels cat",
      "tbm": "isch"
    }

    search = GoogleSearch(params)
    results = search.get_dict()

    print(json.dumps(results['images_results'], indent=2, ensure_ascii=False))


get_google_images()

---------------
'''
[
... # other images 
  {
    "position": 100, # img number
    "thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRR1FCGhFsr_qZoxPvQBDjVn17e_8bA5PB8mg&usqp=CAU",
    "source": "pexels.com",
    "title": "Close-up of Cat · Free Stock Photo",
    "link": "https://www.pexels.com/photo/close-up-of-cat-320014/",
    "original": "https://images.pexels.com/photos/2612982/pexels-photo-2612982.jpeg?auto=compress&cs=tinysrgb&dpr=1&w=500",
    "is_product": false
  }
]
'''

PS - 我写了一篇更深入的博客文章,关于如何抓取谷歌图片,以及如何减少在网络抓取搜索引擎时被阻止的机会

免责声明,我为 SerpApi 工作。

于 2021-10-21T11:01:29.250 回答