python - 爬行谷歌学术

Question

作为研究的一部分，我正在尝试获取大量学术文章的信息。文章的数量在数千个数量级。由于 Google Scholar 没有 API，我正在尝试抓取/抓取学者。现在，我认为这在技术上是违反 EULA 的，但我试图对此保持礼貌和合理。我了解 Google 不允许使用漫游器来将流量控制在合理的范围内。我开始测试一批约 50000 个请求，每个请求之间间隔 1 秒。在大约前 100 个请求后，我被阻止了。我尝试了多种其他策略，包括：

将停顿延长至约 20 秒，并为其添加一些随机噪音
使暂停日志呈正态分布（因此大多数暂停都在几秒钟的数量级，但时不时地会有几分钟或更长时间的暂停）
在请求块（~100）之间进行长时间的停顿（几个小时）。

我怀疑在这一点上，我的脚本是否会比任何人增加任何可观的流量。但是在大约 100-200 个请求之后，我总是被阻止。有谁知道克服这个问题的好策略（我不在乎是否需要数周，只要它是自动化的）。另外，有没有人有直接与谷歌脱节的经历，并要求允许做类似的事情（用于研究等）？是否值得尝试编写它们并解释我正在尝试做什么以及如何做，看看我是否可以获得我的项目的许可？我将如何联系他们？谢谢！

score 2 · Accepted Answer

未经测试，我仍然很确定以下方法之一可以解决问题：

容易，但成功的机会很小：

在每次 rand(0,100) 请求后从相关站点中删除所有 cookie，
然后更改您的用户代理、接受的语言等并重复。
更多的工作，但结果是一个更坚固的蜘蛛：

通过 Tor、其他代理、移动网络等发送您的请求以屏蔽您的 IP（也随时执行建议 1）

关于 Selenium 的更新 我错过了您使用 Selenium 的事实，理所当然地认为它只是某种现代编程语言（我知道 Selenium 可以由最广泛使用的语言驱动，但也可以作为某种浏览器插件，需要很少的编程技能）。

然后我假设您的编码技能不是（或者不是？）令人难以置信，对于使用 Selenium 时具有相同限制的其他人，我的答案是学习简单的脚本语言（PowerShell？！）或 JavaScript （因为它是您所在的网络；-)）并从那里获取。

如果平滑地自动抓取就像浏览器插件一样简单，那么 Web 将必须是一个更加混乱、模糊和需要凭据的地方。

score 0 · Accepted Answer

理想的解决方案是当您拥有可靠的代理和 CAPTCHA 解决服务时，因为在 Google Scholar 上您的 IP 可能会被阻止或设置 IP 速率限制，否则会引发 CAPTCHA，因此拥有这两种服务将导致您想要的结果。

或者，如果您不想花时间寻找那些可靠的服务，您可以使用SerpApi 的Google Scholar API来完成。

它是一个付费 API，带有免费计划，可以解决您在其后端可能遇到的所有问题，因此用户无需考虑或维护它并从头开始构建它。

集成以抓取 Google Scholar 有机结果的示例代码：

import os
from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl


def organic_results():
    print("extracting organic results..")

    params = {
        "api_key": os.getenv("API_KEY"),
        "engine": "google_scholar",
        "q": "minecraft redstone system structure characteristics strength",  # search query
        "hl": "en",        # language
        "as_ylo": "2017",  # from 2017
        "as_yhi": "2021",  # to 2021
        "start": "0"       # first page
    }

    search = GoogleSearch(params)

    organic_results_data = []

    organic_results_is_present = True
    while organic_results_is_present:
        results = search.get_dict()

        print(f"Currently extracting page #{results['serpapi_pagination']['current']}..")

        for result in results["organic_results"]:
            position = result["position"]
            title = result["title"]
            publication_info_summary = result["publication_info"]["summary"]
            result_id = result["result_id"]
            link = result.get("link")
            result_type = result.get("type")
            snippet = result.get("snippet")
  
            try:
              file_title = result["resources"][0]["title"]
            except: file_title = None
  
            try:
              file_link = result["resources"][0]["link"]
            except: file_link = None
  
            try:
              file_format = result["resources"][0]["file_format"]
            except: file_format = None
  
            try:
              cited_by_count = int(result["inline_links"]["cited_by"]["total"])
            except: cited_by_count = None
  
            cited_by_id = result.get("inline_links", {}).get("cited_by", {}).get("cites_id", {})
            cited_by_link = result.get("inline_links", {}).get("cited_by", {}).get("link", {})
  
            try:
              total_versions = int(result["inline_links"]["versions"]["total"])
            except: total_versions = None
  
            all_versions_link = result.get("inline_links", {}).get("versions", {}).get("link", {})
            all_versions_id = result.get("inline_links", {}).get("versions", {}).get("cluster_id", {})
  
            organic_results_data.append({
              "page_number": results["serpapi_pagination"]["current"],
              "position": position + 1,
              "result_type": result_type,
              "title": title,
              "link": link,
              "result_id": result_id,
              "publication_info_summary": publication_info_summary,
              "snippet": snippet,
              "cited_by_count": cited_by_count,
              "cited_by_link": cited_by_link,
              "cited_by_id": cited_by_id,
              "total_versions": total_versions,
              "all_versions_link": all_versions_link,
              "all_versions_id": all_versions_id,
              "file_format": file_format,
              "file_title": file_title,
              "file_link": file_link,
            })
            
            # if next page is present -> split URL in parts and pass it to GoogleSearch() as a dictionary
            # if no next page -> exit the while loop
            if "next" in results["serpapi_pagination"]:
                search.params_dict.update(dict(parse_qsl(urlsplit(results["serpapi_pagination"]["next"]).query)))
            else:
                organic_results_data = False

    return organic_results_data


# part of the output:
'''
]
  {
    "page_number": 3,
    "position": 7,
    "result_type": "Pdf",
    "title": "A methodology proposal for MMORPG content expansion analysis",
    "link": "https://www.sbgames.org/sbgames2017/papers/ArtesDesignFull/175051.pdf",
    "result_id": "wKa3TpX-gn8J",
    "publication_info_summary": "AMM Santos, A Franco, JGR Maia, F Gomes… - … Simpósio Brasileiro de …, 2017 - sbgames.org",
    "snippet": "… Given the problem statement, this work aims to analyze MMORPGs to identify strengths and … Minecraft players built computers 5, calculators and even printers 6 on top of the “redstone” …&quot;,
    "cited_by_count": 4,
    "cited_by_link": "https://scholar.google.com/scholar?cites=9188186107013473984&as_sdt=8000005&sciodt=0,19&hl=en",
    "cited_by_id": "9188186107013473984",
    "total_versions": 2,
    "all_versions_link": "https://scholar.google.com/scholar?cluster=9188186107013473984&hl=en&as_sdt=0,19&as_ylo=2017&as_yhi=2021",
    "all_versions_id": "9188186107013473984",
    "file_format": "PDF",
    "file_title": "sbgames.org",
    "file_link": "https://www.sbgames.org/sbgames2017/papers/ArtesDesignFull/175051.pdf"
  }, ... other results
]
'''

此外，如果你想抓取个人资料和作者结果，有一个专门的用 Python 和 SerpApi博客文章抓取所有 Google Scholar 个人资料、将结果作者结果到 CSV。

免责声明，我为 SerpApi 工作。

python - 爬行谷歌学术

2 回答 2

Related

Reference