您可能发送了太多请求,或者 Google 将您的脚本检测为自动脚本。
您可以尝试做的第一件事是在您的请求中添加代理:
#https://docs.python-requests.org/en/master/user/advanced/#proxies
proxies = {
'http': os.getenv('HTTP_PROXY') # Or just type your proxy here without os.getenv()
}
或者,您可以通过使用其中一个来使其工作,requests-html
或者selenium
在不使用代理的情况下呈现整个 HTML 页面,但您仍然可以获得验证码。
使其工作的代码(我在本地测试了代码):
# If you get an empty array, you get an CAPTCHA from Google.
# Print response to see what cause it.
# Note: code below doesn't do pagination. https://requests-html.kennethreitz.org/#pagination
from requests_html import HTMLSession
session = HTMLSession()
url = 'https://scholar.google.pl/citations?view_op=search_authors&hl=pl&mauthors=label:security'
response = session.get(url)
# https://requests-html.kennethreitz.org/#requests_html.HTML.render
response.html.render(sleep=1)
for author_name in response.html.find('.gs_ai_name'):
name = author_name.text
print(name)
输出:
Johnson Thomas
Martin Abadi
Adrian Perrig
Vern Paxson
Frans Kaashoek
Mihir Bellare
Matei Zaharia
Helen J. Wang
Zhu Han
Sushil Jajodia
或者,您可以使用来自 SerpApi 的Google Scholar Profiles API 。这是一个付费 API,可试用 5,000 次搜索。目前正在开发完全免费的试用版。
主要区别在于您不必考虑解决验证码或由于渲染页面或具有多个实例的 PC 压力而经历缓慢的抓取过程,例如使用selenium
要集成的代码:
from serpapi import GoogleSearch
params = {
"engine": "google_scholar_profiles",
"hl": "en",
"mauthors": "label:security",
"api_key": "YOUR_API_KEY"
}
search = GoogleSearch(params)
results = search.get_dict()
for author_name in results['profiles']:
name = author_name['name']
print(name)
输出:
Johnson Thomas
Martin Abadi
Adrian Perrig
Vern Paxson
Frans Kaashoek
Mihir Bellare
Matei Zaharia
Helen J. Wang
Zhu Han
Sushil Jajodia
JSON 输出的一部分:
"profiles": [
{
"name": "Johnson Thomas",
"link": "https://scholar.google.com/citations?hl=en&user=eKLr0EgAAAAJ",
"serpapi_link": "https://serpapi.com/search.json?author_id=eKLr0EgAAAAJ&engine=google_scholar_author&hl=en",
"author_id": "eKLr0EgAAAAJ",
"affiliations": "Professor of Computer Science, Oklahoma State University",
"email": "Verified email at cs.okstate.edu",
"cited_by": 150263,
"interests": [
{
"title": "Security",
"serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=en&mauthors=label%3Asecurity",
"link": "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:security"
}
]
}
]
免责声明,我为 SerpApi 工作。