web-scraping - 允许处理/迭代/查询科学文章的免费 API/库（即 Google Scholar）？

Question

我正在尝试找到一种可行的方法来遍历 Google Scholar（或任何其他科学图书馆）上的所有科学论文。我不需要论文内容，只需要标题、作者、引文和摘要。

我正在寻找某种类型的库/api，它们允许我迭代和处理这些论文，以及具有高查询能力的东西。

到目前为止，我发现的唯一一个是学术性的。查询似乎很不错，但是，我看不到任何迭代所有内容的选项。

有没有其他网络抓取工具可以让我这样做？

score 0 · Accepted Answer

有一个来自 SerpApi的Google Scholar API ，它支持 Organic、cite、profile、author结果。它可以扩展到企业级，绕过 Google 的阻止，而无需自己弄清楚。

在在线 IDE中提取有机结果和完整示例的集成示例代码：

# to scrape profile results, author:
# https://replit.com/@DimitryZub1/Scrape-Google-Scholar-Profile-Results-from-all-Pages#main.py

import json
from serpapi import GoogleScholarSearch

params = {
    "api_key": "Your SerpApi API key",
    "engine": "google_scholar",
    "q": "biology",                    # search query
    "hl": "en"                         # language
}

search = GoogleScholarSearch(params)   # where extraction happens
results = search.get_dict()            # JSON -> Python dict

for result in results["organic_results"]:
    print(json.dumps(result, indent=2))

# part of the output:
'''
{
  "position": 0,
  "title": "The biology of mycorrhiza.",
  "result_id": "6zRLFbcxtREJ",
  "link": "https://www.cabdirect.org/cabdirect/abstract/19690600367",
  "snippet": "In the second, revised and extended, edition of this work [cf. FA 20 No. 4264], two new chapters have been added (on carbohydrate physiology physiology Subject Category \u2026",
  "publication_info": {
    "summary": "JL Harley - The biology of mycorrhiza., 1969 - cabdirect.org"
  },
  "inline_links": {
    "serpapi_cite_link": "https://serpapi.com/search.json?engine=google_scholar_cite&q=6zRLFbcxtREJ",
    "cited_by": {
      "total": 704,
      "link": "https://scholar.google.com/scholar?cites=1275980731835430123&as_sdt=2005&sciodt=0,5&hl=en",
      "cites_id": "1275980731835430123",
      "serpapi_scholar_link": "https://serpapi.com/search.json?as_sdt=2005&cites=1275980731835430123&engine=google_scholar&hl=en"
    },
    "related_pages_link": "https://scholar.google.com/scholar?q=related:6zRLFbcxtREJ:scholar.google.com/&scioq=biology&hl=en&as_sdt=0,5",
    "versions": {
      "total": 4,
      "link": "https://scholar.google.com/scholar?cluster=1275980731835430123&hl=en&as_sdt=0,5",
      "cluster_id": "1275980731835430123",
      "serpapi_scholar_link": "https://serpapi.com/search.json?as_sdt=0%2C5&cluster=1275980731835430123&engine=google_scholar&hl=en"
    },
    "cached_page_link": "https://scholar.googleusercontent.com/scholar?q=cache:6zRLFbcxtREJ:scholar.google.com/+biology&hl=en&as_sdt=0,5"
  }
}
... other results
'''

如果您想从所有可用页面或某个作者的所有出版物中抓取数据，可以使用专门的将Google Scholar 历史结果抓取到 CSV、SQLite或将所有 Google Scholar 个人资料抓取到我在 SerpApi 的 CSV 博客文章的作者结果。

免责声明，我为 SerpApi 工作。

score 0 · Accepted Answer

在不知道你的具体目的的情况下，很难给出一个好的回应。

但是，科学元数据（例如标题、作者、引文）的首选位置将是CrossRef 的 API。它可以免费使用。

虽然我不知道您如何确定您的样本，但您可以，例如，获取期刊的 ISSN 以获取有关期刊论文的元数据（例如此处），或者您可以使用出版物的 DOI 来获取有关该特定论文的元数据（这里的例子）。

web-scraping - 允许处理/迭代/查询科学文章的免费 API/库（即 Google Scholar）？

2 回答 2

Related

Reference