确保您正在使用user-agent
,因为默认请求user-agent
是python-requests
并且 Google 可能会阻止您的请求,并且您会收到不同的 HTML,其中包含某种错误,其中不包含您尝试选择的选择器。检查你的user-agent
.
在提出请求时轮换user-agents
也可能是个好主意。
在在线 IDE 中抓取更多内容的代码和完整示例:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
queries = ['Role for migratory wild birds in the global spread of avian influenza H5N8',
'Uncoupling conformational states from activity in an allosteric enzyme',
'Technological Analysis of the World’s Earliest Shamanic Costume: A Multi-Scalar, Experimental Study of a Red Deer Headdress from the Early Holocene Site of Star Carr, North Yorkshire, UK',
'Oxidative potential of PM 2.5 during Atlanta rush hour: Measurements of in-vehicle dithiothreitol (DTT) activity',
'Primary Prevention of CVD','Growth and Deposition of Au Nanoclusters on Polymer-wrapped Graphene and Their Oxygen Reduction Activity',
'Relations of Preschoolers Visual-Motor and Object Manipulation Skills With Executive Function and Social Behavior',
'We Know Who Likes Us, but Not Who Competes Against Us']
for query in queries:
params = {
"q": query,
"hl": "en",
}
html = requests.get('https://scholar.google.com/scholar', headers=headers, params=params, proxies=proxies).text
soup = BeautifulSoup(html, 'lxml')
# Container where all needed data is located
for result in soup.select('.gs_ri'):
title = result.select_one('.gs_rt').text
title_link = result.select_one('.gs_rt a')['href']
cited_by = result.select_one('#gs_res_ccl_mid .gs_nph+ a')['href']
cited_by_count = result.select_one('#gs_res_ccl_mid .gs_nph+ a').text.split(' ')[2]
print(f"{title}\n{title_link}\n{cited_by}\n{cited_by_count}\n")
或者,您可以使用来自 SerpApi的Google Scholar Organic Results API来实现相同的目的。这是一个带有免费计划的付费 API。
您的情况的不同之处在于,您只需要遍历结构化 JSON 并获取您想要的数据,而不是弄清楚为什么某些事情不能按应有的方式工作。
要集成的代码:
from serpapi import GoogleSearch
import os
queries = ['Role for migratory wild birds in the global spread of avian influenza H5N8',
'Uncoupling conformational states from activity in an allosteric enzyme',
'Technological Analysis of the World’s Earliest Shamanic Costume: A Multi-Scalar, Experimental Study of a Red Deer Headdress from the Early Holocene Site of Star Carr, North Yorkshire, UK',
'Oxidative potential of PM 2.5 during Atlanta rush hour: Measurements of in-vehicle dithiothreitol (DTT) activity',
'Primary Prevention of CVD','Growth and Deposition of Au Nanoclusters on Polymer-wrapped Graphene and Their Oxygen Reduction Activity',
'Relations of Preschoolers Visual-Motor and Object Manipulation Skills With Executive Function and Social Behavior',
'We Know Who Likes Us, but Not Who Competes Against Us']
for query in queries:
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google_scholar",
"q": query,
}
search = GoogleSearch(params)
results = search.get_dict()
data = []
for result in results['organic_results']:
data.append({
'title': result['title'],
'link': result['link'],
'publication_info': result['publication_info']['summary'],
'snippet': result['snippet'],
'cited_by': result['inline_links']['cited_by']['link'],
'related_versions': result['inline_links']['related_pages_link'],
})
print(json.dumps(data, indent=2, ensure_ascii=False))
PS - 我写了一篇关于如何用视觉表示在Google Scholar上抓取几乎所有内容的博客文章。
免责声明,我为 SerpApi 工作。