0

I am trying to make a custom Index Check utility to check which URLs have been indexed by Google using Python and selenium

I need to get the Google search results so that I can check weather the queried url exists in the results or not. I am able to get 50 to 60 results before getting Google Captcha.

Below is my concerned code

options = webdriver.FirefoxOptions()
options.set_headless()

driver = webdriver.Firefox(executable_path=r'./geckodriver', firefox_options=options)

urls = [line.strip() for line in open('urls.txt', 'r')]

url_search = "https://www.google.com/search?"

for c, link in enumerate(urls):

    query = {'q': link}
    full_url = url_search + urlencode(query)

    driver.get(full_url)
    soup = BeautifulSoup(driver.page_source, 'html.parser')

I've tried using ChromeDriver and gecko driver both in headless mode but got the same result.

My main concern is how can I use selenium without getting detected?

I know google doesn't allow scraping but there are some paid APIs which does exactly the same thing i.e. providing Google Search Results. How are they working??!!

I've also searched for Google APIs but can't find one for my use case.

Also, if google doesn't allow scraping, then why does it let scrapers scrape for a limited number of times?

Thanks for your time, I really appreciate it.

4

3 回答 3

2

如果一个网站不想让你抓取它,你通常无能为力,尤其是对于谷歌或亚马逊这样的网站。事实上,这也是你是否应该这样做的问题。

我知道谷歌不允许抓取,但有一些付费 API 的功能完全相同,即提供谷歌搜索结果。他们是怎么工作的??!!

他们使用的工具与您正在使用的工具类似,只是规模更大。一个例子是容器中的多个抓取代理,每个都使用不同的代理,直到它们被检测到。然后,特工将他们的发现结合起来,重新启动以进一步抓取。

另外,如果谷歌不允许抓取,那么为什么它让抓取器抓取有限的次数呢?

这可能会发生,因为可能需要一些时间才能确定正在使用机器人。此外,在他们决定您滥用他们的服务之前,您可能需要刮一会儿。

但是,您可以尝试几件事。您可以将用户代理与 Selenium 一起使用,并将其包含在您的选项中:options.add_argument('--disable-blink-features=AutomationControlled'). 后者在使用 Selenium 的 Chrome 上为某些网站创造了奇迹,但我不确定它是否与 Firefox 相同。

于 2021-04-10T09:16:20.623 回答
1

要绕过 Google 的 CAPTCHA,您真的无能为力。您可以尝试更改 User-Agent 和其他一些属性。这篇文章可能会对你有所帮助。

对于您的最后一个问题,Google 似乎有一个您可以免费使用的搜索 API(当然也有付费计划)。是一篇关于它的博客文章。

于 2021-04-10T09:05:14.643 回答
0

您可以使用requestsbs4库来代替,selenium因为 Google 搜索结果中的所有内容都位于 HTML 中。

确保您使用user-agent的是伪造真实用户访问,因为如果您使用requests库,默认user-agent将是python-requests,我们需要避免它。

假设您想从该标题中抓取标题和 URL,例如在线 IDE

from bs4 import BeautifulSoup
import requests, lxml

# Faking real user visit.
headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}

# Search query.
params = {'q': 'ice cream'}


html = requests.get(f'https://www.google.com/search?q=',
                      headers=headers,
                      params=params).text

# select() uses CSS selectors. It's like findAll() or find_all(), you can iterate over it.
# if you want to scrape just one element, you can use select_one() method instead.
for result in soup.select('.yuRUbf'):
  title = result.select_one('.DKV0Md').text
  link = result.select_one('a')['href']
  print(f'{title}\n{link}\n')

或者,您可以使用来自 SerpApi 的Google 搜索引擎结果 API来获得这些结果。这是一个付费 API,可免费试用 5,000 次搜索。

在在线 IDE中集成和示例的代码:

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "ice cream",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

# Iterates over JSON output and prints Title, Snippet (summary) and link on the new line
for result in results["organic_results"]:
  print(f"Title: {result['title']}\nSummary: {result['snippet']}\nLink: {result['link']}\n")

免责声明,我为 SerpApi 工作。

于 2021-05-13T08:00:35.600 回答