I am trying to make a custom Index Check utility to check which URLs have been indexed by Google using Python and selenium
I need to get the Google search results so that I can check weather the queried url exists in the results or not. I am able to get 50 to 60 results before getting Google Captcha.
Below is my concerned code
options = webdriver.FirefoxOptions()
options.set_headless()
driver = webdriver.Firefox(executable_path=r'./geckodriver', firefox_options=options)
urls = [line.strip() for line in open('urls.txt', 'r')]
url_search = "https://www.google.com/search?"
for c, link in enumerate(urls):
query = {'q': link}
full_url = url_search + urlencode(query)
driver.get(full_url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
I've tried using ChromeDriver and gecko driver both in headless mode but got the same result.
My main concern is how can I use selenium without getting detected?
I know google doesn't allow scraping but there are some paid APIs which does exactly the same thing i.e. providing Google Search Results. How are they working??!!
I've also searched for Google APIs but can't find one for my use case.
Also, if google doesn't allow scraping, then why does it let scrapers scrape for a limited number of times?
Thanks for your time, I really appreciate it.