您可以使用requests
自动为您进行解码。
注意:after_author
URL 参数是下一页标记,因此当您向您提供的确切 URL 发出请求时,返回的 HTML 将与您预期的不同,因为after_author
URL 参数在每个请求中都会更改,例如在我的情况下它是不同的- uB8AAEFN__8J
,在您的 URL 中是rukAAOJ8__8J
.
为了让它工作,你需要从第一页解析下一页令牌,这将导致第二页等等,例如:
# from my other answer:
# https://github.com/dimitryzub/stackoverflow-answers-archive/blob/main/answers/scrape_all_scholar_profiles_bs4.py
params = {
"view_op": "search_authors",
"mauthors": "valve",
"hl": "pl",
"astart": 0
}
authors_is_present = True
while authors_is_present:
# if next page is present -> update next page token and increment to the next page
# if next page is not present -> exit the while loop
if soup.select_one("button.gs_btnPR")["onclick"]:
params["after_author"] = re.search(r"after_author\\x3d(.*)\\x26", str(soup.select_one("button.gs_btnPR")["onclick"])).group(1) # -> XB0HAMS9__8J
params["astart"] += 10
else:
authors_is_present = False
在在线 IDE 中提取配置文件数据的代码和示例:
from parsel import Selector
import requests, json
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "label:security",
"hl": "pl",
"view_op": "search_authors"
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36",
}
html = requests.get("https://scholar.google.pl/citations", params=params, headers=headers, timeout=30)
selector = Selector(html.text)
profiles = []
for profile in selector.css(".gs_ai_chpr"):
profile_name = profile.css(".gs_ai_name a::text").get()
profile_link = f'https://scholar.google.com{profile.css(".gs_ai_name a::attr(href)").get()}'
profile_email = profile.css(".gs_ai_eml::text").get()
profile_interests = profile.css(".gs_ai_one_int::text").getall()
profiles.append({
"profile_name": profile_name,
"profile_link": profile_link,
"profile_email": profile_email,
"profile_interests": profile_interests
})
print(json.dumps(profiles, indent=2))
或者,您可以使用来自 SerpApi的Google Scholar Profiles API来实现相同的目的。这是一个带有免费计划的付费 API。
不同之处在于您不需要弄清楚如何提取数据、绕过搜索引擎的阻止、增加请求的数量等等。
要集成的示例代码:
from serpapi import GoogleSearch
import os, json
params = {
"api_key": os.getenv("API_KEY"), # SerpApi API key
"engine": "google_scholar_profiles", # SerpApi profiles parsing engine
"hl": "pl", # language
"mauthors": "label:security" # search query
}
search = GoogleSearch(params)
results = search.get_dict()
for profile in results["profiles"]:
print(json.dumps(profile, indent=2))
# part of the output:
'''
{
"name": "Johnson Thomas",
"link": "https://scholar.google.com/citations?hl=pl&user=eKLr0EgAAAAJ",
"serpapi_link": "https://serpapi.com/search.json?author_id=eKLr0EgAAAAJ&engine=google_scholar_author&hl=pl",
"author_id": "eKLr0EgAAAAJ",
"affiliations": "Professor of Computer Science, Oklahoma State University",
"email": "Zweryfikowany adres z cs.okstate.edu",
"cited_by": 159999,
"interests": [
{
"title": "Security",
"serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=pl&mauthors=label%3Asecurity",
"link": "https://scholar.google.com/citations?hl=pl&view_op=search_authors&mauthors=label:security"
},
{
"title": "cloud computing",
"serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=pl&mauthors=label%3Acloud_computing",
"link": "https://scholar.google.com/citations?hl=pl&view_op=search_authors&mauthors=label:cloud_computing"
},
{
"title": "big data",
"serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=pl&mauthors=label%3Abig_data",
"link": "https://scholar.google.com/citations?hl=pl&view_op=search_authors&mauthors=label:big_data"
}
],
"thumbnail": "https://scholar.google.com/citations/images/avatar_scholar_56.png"
}
'''
免责声明,我为 SerpApi 工作。