理想的解决方案是当您拥有可靠的代理和 CAPTCHA 解决服务时,因为在 Google Scholar 上您的 IP 可能会被阻止或设置 IP 速率限制,否则会引发 CAPTCHA,因此拥有这两种服务将导致您想要的结果。
或者,如果您不想花时间寻找那些可靠的服务,您可以使用SerpApi 的Google Scholar API来完成。
它是一个付费 API,带有免费计划,可以解决您在其后端可能遇到的所有问题,因此用户无需考虑或维护它并从头开始构建它。
集成以抓取 Google Scholar 有机结果的示例代码:
import os
from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
def organic_results():
print("extracting organic results..")
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google_scholar",
"q": "minecraft redstone system structure characteristics strength", # search query
"hl": "en", # language
"as_ylo": "2017", # from 2017
"as_yhi": "2021", # to 2021
"start": "0" # first page
}
search = GoogleSearch(params)
organic_results_data = []
organic_results_is_present = True
while organic_results_is_present:
results = search.get_dict()
print(f"Currently extracting page #{results['serpapi_pagination']['current']}..")
for result in results["organic_results"]:
position = result["position"]
title = result["title"]
publication_info_summary = result["publication_info"]["summary"]
result_id = result["result_id"]
link = result.get("link")
result_type = result.get("type")
snippet = result.get("snippet")
try:
file_title = result["resources"][0]["title"]
except: file_title = None
try:
file_link = result["resources"][0]["link"]
except: file_link = None
try:
file_format = result["resources"][0]["file_format"]
except: file_format = None
try:
cited_by_count = int(result["inline_links"]["cited_by"]["total"])
except: cited_by_count = None
cited_by_id = result.get("inline_links", {}).get("cited_by", {}).get("cites_id", {})
cited_by_link = result.get("inline_links", {}).get("cited_by", {}).get("link", {})
try:
total_versions = int(result["inline_links"]["versions"]["total"])
except: total_versions = None
all_versions_link = result.get("inline_links", {}).get("versions", {}).get("link", {})
all_versions_id = result.get("inline_links", {}).get("versions", {}).get("cluster_id", {})
organic_results_data.append({
"page_number": results["serpapi_pagination"]["current"],
"position": position + 1,
"result_type": result_type,
"title": title,
"link": link,
"result_id": result_id,
"publication_info_summary": publication_info_summary,
"snippet": snippet,
"cited_by_count": cited_by_count,
"cited_by_link": cited_by_link,
"cited_by_id": cited_by_id,
"total_versions": total_versions,
"all_versions_link": all_versions_link,
"all_versions_id": all_versions_id,
"file_format": file_format,
"file_title": file_title,
"file_link": file_link,
})
# if next page is present -> split URL in parts and pass it to GoogleSearch() as a dictionary
# if no next page -> exit the while loop
if "next" in results["serpapi_pagination"]:
search.params_dict.update(dict(parse_qsl(urlsplit(results["serpapi_pagination"]["next"]).query)))
else:
organic_results_data = False
return organic_results_data
# part of the output:
'''
]
{
"page_number": 3,
"position": 7,
"result_type": "Pdf",
"title": "A methodology proposal for MMORPG content expansion analysis",
"link": "https://www.sbgames.org/sbgames2017/papers/ArtesDesignFull/175051.pdf",
"result_id": "wKa3TpX-gn8J",
"publication_info_summary": "AMM Santos, A Franco, JGR Maia, F Gomes… - … Simpósio Brasileiro de …, 2017 - sbgames.org",
"snippet": "… Given the problem statement, this work aims to analyze MMORPGs to identify strengths and … Minecraft players built computers 5, calculators and even printers 6 on top of the “redstone” …",
"cited_by_count": 4,
"cited_by_link": "https://scholar.google.com/scholar?cites=9188186107013473984&as_sdt=8000005&sciodt=0,19&hl=en",
"cited_by_id": "9188186107013473984",
"total_versions": 2,
"all_versions_link": "https://scholar.google.com/scholar?cluster=9188186107013473984&hl=en&as_sdt=0,19&as_ylo=2017&as_yhi=2021",
"all_versions_id": "9188186107013473984",
"file_format": "PDF",
"file_title": "sbgames.org",
"file_link": "https://www.sbgames.org/sbgames2017/papers/ArtesDesignFull/175051.pdf"
}, ... other results
]
'''
此外,如果你想抓取个人资料和作者结果,有一个专门的用 Python 和 SerpApi博客文章抓取所有 Google Scholar 个人资料、将结果作者结果到 CSV。
免责声明,我为 SerpApi 工作。