python - 使用 Python（或 R）提取 Google Scholar 结果

Question

我想使用 python 来抓取 google 学者搜索结果。我找到了两个不同的脚本来做到这一点，一个是gscholar.py，另一个是scholar.py（那个可以用作 python 库吗？）。

现在，我也许应该说我对 python 完全陌生，如果我错过了显而易见的事情，我很抱歉！

问题是当我gscholar.py按照自述文件中的说明使用时，我得到了结果

query() takes at least 2 arguments (1 given).

即使我指定另一个参数（例如gscholar.query("my query", allresults=True)，我得到

query() takes at least 2 arguments (2 given).

这让我很困惑。我还尝试指定第三个可能的参数（outformat=4; 这是 BibTex 格式），但这给了我一个函数错误列表。一位同事建议我在运行查询之前导入 BeautifulSoup 和this，但这也不会改变问题。任何建议如何解决这个问题？

我找到了 R 的代码（参见链接）作为解决方案，但很快就被谷歌阻止了。也许有人可以建议如何改进该代码以避免被阻止？任何帮助，将不胜感激！谢谢！

score 14 · Accepted Answer

我建议你不要使用特定的库来爬取特定的网站，而是使用经过良好测试并具有良好文档格式的通用 HTML 库，例如 BeautifulSoup。

要使用浏览器信息访问网站，您可以使用带有自定义用户代理的 url 打开器类：

from urllib import FancyURLopener
class MyOpener(FancyURLopener):
    version = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36'
openurl = MyOpener().open

然后下载所需的url如下：

openurl(url).read()

要检索学者结果，只需使用http://scholar.google.se/scholar?hl=en&q=${query}url。

要从检索到的 HTML 文件中提取信息，您可以使用以下代码：

from bs4 import SoupStrainer, BeautifulSoup
page = BeautifulSoup(openurl(url).read(), parse_only=SoupStrainer('div', id='gs_ab_md'))

这段代码提取了一个具体div元素，其中包含 Google Scholar 搜索结果页面中显示的结果数量。

score 8 · Accepted Answer

谷歌会阻止你......因为很明显你不是浏览器。也就是说，与合理的人类活动相比，它们将检测到相同的请求签名发生得太频繁。

你可以做：

如何在 Python 中通过 Tor 发出 urllib2 请求？
在您的大学计算机上运行代码（可能没有帮助）
使用Google 学者 API可能会花费您的钱，并且无法为您提供作为人性化的普通用户所看到的全部功能。

2020年编辑：

你可能想在学术上检查

>>> search_query = scholarly.search_author('Marty Banks, Berkeley')
>>> print(next(search_query))
{'_filled': False,
 'affiliation': 'Professor of Vision Science, UC Berkeley',
 'citedby': 17758,
 'email': '@berkeley.edu',
 'id': 'Smr99uEAAAAJ',
 'interests': ['vision science', 'psychology', 'human factors', 'neuroscience'],
 'name': 'Martin Banks',
 'url_picture': 'https://scholar.google.com/citations?view_op=medium_photo&user=Smr99uEAAAAJ'}

score 4 · Accepted Answer

看起来使用 Python 和 R 进行抓取会遇到问题，即由于请求中缺少用户代理，Google Scholar 将您的请求视为机器人查询。StackExchange 中有一个类似的问题，关于下载从网页链接的所有 pdf，答案引导用户使用 Unix 中的 wget 和 Python 中的 BeautifulSoup 包。

卷曲似乎也是一个更有希望的方向。

score 2 · Accepted Answer

COPython 看起来是正确的，但这里有一些示例解释......

考虑 f：

def f(a,b,c=1):
    pass

f 无论如何都需要 a 和 b 的值。您可以将 c 留空。

f(1,2)     #executes fine
f(a=1,b=2) #executes fine
f(1,c=1)   #TypeError: f() takes at least 2 arguments (2 given)

您被 Google 阻止的事实可能是由于您的标头中的用户代理设置...我不熟悉 R，但我可以为您提供解决此问题的一般算法：

使用普通浏览器（firefox 或其他）访问 url，同时监控 HTTP 流量（我喜欢 wireshark）
记下在适当的 http 请求中发送的所有标头
尝试运行您的脚本并注意标题
指出不同
设置您的 R 脚本以使用您在检查浏览器流量时看到的标头

score 1 · Accepted Answer

这是查询（）的调用签名...

def query(searchstr, outformat, allresults=False)

因此，您至少需要指定一个 searchstr 和一个 outformat，并且 allresults 是一个可选的标志/参数。

score 0 · Accepted Answer

您可能希望使用Greasemonkey来完成此任务。优点是如果您另外降低请求频率，谷歌将无法将您检测为机器人。您还可以在浏览器窗口中观看脚本。

您可以学习自己编写代码或使用这些来源之一的脚本。

score 0 · Accepted Answer

一个理想的场景是当您拥有良好的代理时，住宅是理想的选择，这将允许您选择特定的位置（国家、城市或移动运营商）和 CAPTCHA 解决服务。

作为替代解决方案，您可以使用来自 SerpApi 的Google Scholar API 。

它是一个付费 API，带有免费计划，可以通过代理和 CAPTCHA 解决方案绕过 Google 的阻止，可以扩展到企业级，而且最终用户无需从头开始创建解析器并随着时间的推移维护它。 HTML 已更改。

此外，它还支持cite、profile、author结果。

集成以解析有机结果的示例代码：

import json

from serpapi import GoogleScholarSearch

params = {
    "api_key": "Your SerpAPi API KEY",
    "engine": "google_scholar",
    "q": "biology",
    "hl": "en"
}

search = GoogleScholarSearch(params)
results = search.get_dict()

for result in results["organic_results"]:
    print(json.dumps(result, indent=2))

# first organic results output:
'''
{
  "position": 0,
  "title": "The biology of mycorrhiza.",
  "result_id": "6zRLFbcxtREJ",
  "link": "https://www.cabdirect.org/cabdirect/abstract/19690600367",
  "snippet": "In the second, revised and extended, edition of this work [cf. FA 20 No. 4264], two new chapters have been added (on carbohydrate physiology physiology Subject Category \u2026",
  "publication_info": {
    "summary": "JL Harley - The biology of mycorrhiza., 1969 - cabdirect.org"
  },
  "inline_links": {
    "serpapi_cite_link": "https://serpapi.com/search.json?engine=google_scholar_cite&q=6zRLFbcxtREJ",
    "cited_by": {
      "total": 704,
      "link": "https://scholar.google.com/scholar?cites=1275980731835430123&as_sdt=5,50&sciodt=0,50&hl=en",
      "cites_id": "1275980731835430123",
      "serpapi_scholar_link": "https://serpapi.com/search.json?as_sdt=5%2C50&cites=1275980731835430123&engine=google_scholar&hl=en"
    },
    "related_pages_link": "https://scholar.google.com/scholar?q=related:6zRLFbcxtREJ:scholar.google.com/&scioq=biology&hl=en&as_sdt=0,50",
    "versions": {
      "total": 4,
      "link": "https://scholar.google.com/scholar?cluster=1275980731835430123&hl=en&as_sdt=0,50",
      "cluster_id": "1275980731835430123",
      "serpapi_scholar_link": "https://serpapi.com/search.json?as_sdt=0%2C50&cluster=1275980731835430123&engine=google_scholar&hl=en"
    },
    "cached_page_link": "https://scholar.googleusercontent.com/scholar?q=cache:6zRLFbcxtREJ:scholar.google.com/+biology&hl=en&as_sdt=0,50"
  }
}
... other results
'''

在我的SerpApi 博客文章中，还有一个使用 Python 的专用 Scrape 历史 Google Scholar 结果。

免责声明，我为 SerpApi 工作。

python - 使用 Python（或 R）提取 Google Scholar 结果

7 回答 7

Related

Reference