I am trying to make a custom Index Check utility to check which URLs have been indexed by Google using Python and selenium

I need to get the Google search results so that I can check weather the queried url exists in the results or not. I am able to get 50 to 60 results before getting Google Captcha.

Below is my concerned code

options = webdriver.FirefoxOptions()

driver = webdriver.Firefox(executable_path=r'./geckodriver', firefox_options=options)

urls = [line.strip() for line in open('urls.txt', 'r')]

url_search = "https://www.google.com/search?"

for c, link in enumerate(urls):

    query = {'q': link}
    full_url = url_search + urlencode(query)

    soup = BeautifulSoup(driver.page_source, 'html.parser')

I've tried using ChromeDriver and gecko driver both in headless mode but got the same result.

My main concern is how can I use selenium without getting detected?

I know google doesn't allow scraping but there are some paid APIs which does exactly the same thing i.e. providing Google Search Results. How are they working??!!

I've also searched for Google APIs but can't find one for my use case.

Also, if google doesn't allow scraping, then why does it let scrapers scrape for a limited number of times?

Thanks for your time, I really appreciate it.


我知道谷歌不允许抓取,但有一些付费 API 的功能完全相同,即提供谷歌搜索结果。他们是怎么工作的??!!




但是,您可以尝试几件事。您可以将用户代理与 Selenium 一起使用,并将其包含在您的选项中:options.add_argument('--disable-blink-features=AutomationControlled'). 后者在使用 Selenium 的 Chrome 上为某些网站创造了奇迹,但我不确定它是否与 Firefox 相同。

要绕过 Google 的 CAPTCHA,您真的无能为力。您可以尝试更改 User-Agent 和其他一些属性。这篇文章可能会对你有所帮助。

对于您的最后一个问题,Google 似乎有一个您可以免费使用的搜索 API(当然也有付费计划)。是一篇关于它的博客文章。

您可以使用requestsbs4库来代替,selenium因为 Google 搜索结果中的所有内容都位于 HTML 中。


假设您想从该标题中抓取标题和 URL,例如在线 IDE

from bs4 import BeautifulSoup
import requests, lxml

# Faking real user visit.
headers = {
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"

# Search query.
params = {'q': 'ice cream'}

html = requests.get(f'https://www.google.com/search?q=',

# select() uses CSS selectors. It's like findAll() or find_all(), you can iterate over it.
# if you want to scrape just one element, you can use select_one() method instead.
for result in soup.select('.yuRUbf'):
  title = result.select_one('.DKV0Md').text
  link = result.select_one('a')['href']

或者,您可以使用来自 SerpApi 的Google 搜索引擎结果 API来获得这些结果。这是一个付费 API,可免费试用 5,000 次搜索。

在在线 IDE中集成和示例的代码:

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "ice cream",
    "api_key": os.getenv("API_KEY"),

search = GoogleSearch(params)
results = search.get_dict()

# Iterates over JSON output and prints Title, Snippet (summary) and link on the new line
for result in results["organic_results"]:
  print(f"Title: {result['title']}\nSummary: {result['snippet']}\nLink: {result['link']}\n")

免责声明,我为 SerpApi 工作。

