python - 从 Google Scholar 搜索结果中抓取和解析引文信息

Question

我有一个大约 20000 篇文章标题的列表，我想从谷歌学者那里获取他们的引用数。我是 BeautifulSoup 库的新手。我有这个代码：

import requests
from bs4 import BeautifulSoup

query = ['Role for migratory wild birds in the global spread of avian 
 influenza H5N8','Uncoupling conformational states from activity in an 
 allosteric enzyme','Technological Analysis of the World’s Earliest 
 Shamanic Costume: A Multi-Scalar, Experimental Study of a Red Deer 
 Headdress from the Early Holocene Site of Star Carr, North Yorkshire, 
 UK','Oxidative potential of PM 2.5  during Atlanta rush hour: 
 Measurements of in-vehicle dithiothreitol (DTT) activity','Primary 
 Prevention of CVD','Growth and Deposition of Au Nanoclusters on Polymer- 
 wrapped Graphene and Their Oxygen Reduction Activity','Relations of 
 Preschoolers Visual-Motor and Object Manipulation Skills With Executive 
 Function and Social Behavior','We Know Who Likes Us, but Not Who Competes 
 Against Us']

url = 'https://scholar.google.com/scholar?q=' + query + '&ie=UTF-8&oe=UTF- 
       8&hl=en&btnG=Search'

content = requests.get(url).text
page = BeautifulSoup(content, 'lxml')
results = []
for entry in page.find_all("h3", attrs={"class": "gs_rt"}):
    results.append({"title": entry.a.text, "url": entry.a['href']})

但它只返回标题和网址。我不知道如何从另一个标签获取引文信息。请帮帮我。

score 3 · Accepted Answer

您需要循环列表。您可以使用 Session 来提高效率。以下是 bs 4.7.1，它支持:contains用于查找引用计数的伪类。看起来您可以从 css 选择器中删除类型选择器，并在ieh3之前使用 class 。如果您没有 4.7.1。您可以使用来选择引用计数。a.gs_rt a[title=Cite] + a

import requests
from bs4 import BeautifulSoup as bs

queries = ['Role for migratory wild birds in the global spread of avian influenza H5N8',
         'Uncoupling conformational states from activity in an allosteric enzyme',
         'Technological Analysis of the World’s Earliest Shamanic Costume: A Multi-Scalar, Experimental Study of a Red Deer Headdress from the Early Holocene Site of Star Carr, North Yorkshire, UK',
         'Oxidative potential of PM 2.5  during Atlanta rush hour: Measurements of in-vehicle dithiothreitol (DTT) activity',
         'Primary Prevention of CVD','Growth and Deposition of Au Nanoclusters on Polymer-wrapped Graphene and Their Oxygen Reduction Activity',
         'Relations of Preschoolers Visual-Motor and Object Manipulation Skills With Executive Function and Social Behavior',
         'We Know Who Likes Us, but Not Who Competes Against Us']

with requests.Session() as s:
    for query in queries:
        url = 'https://scholar.google.com/scholar?q=' + query + '&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search'
        r = s.get(url)
        soup = bs(r.content, 'lxml') # or 'html.parser'
        title = soup.select_one('h3.gs_rt a').text if soup.select_one('h3.gs_rt a') is not None else 'No title'
        link = soup.select_one('h3.gs_rt a')['href'] if title != 'No title' else 'No link'
        citations = soup.select_one('a:contains("Cited by")').text if soup.select_one('a:contains("Cited by")') is not None else 'No citation count'
        print(title, link, citations)

< 4.7.1 的替代方案。

with requests.Session() as s:
    for query in queries:
        url = 'https://scholar.google.com/scholar?q=' + query + '&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search'
        r = s.get(url)
        soup = bs(r.content, 'lxml') # or 'html.parser'
        title = soup.select_one('.gs_rt a')
        if title is None:
            title = 'No title'
            link = 'No link'
        else:  
            link = title['href']
            title = title.text
        citations = soup.select_one('[title=Cite] + a')
        if citations is None:
            citations = 'No citation count'
        else:
             citations = citations.text
        print(title, link, citations)

感谢@facelessuser 的评论，重新编写了底部版本。顶部版本留作比较：

在单行 if 语句中不调用 select_one 两次可能会更有效。当模式构建被缓存时，返回的标签不会被缓存。我个人会将变量设置为 select_one 返回的任何内容，然后，仅当变量为 None 时，将其更改为 No link 或 No title 等。它不那么紧凑，但它会更有效

[...] 始终检查 if tag 是否为 None: 而不仅仅是 if tag:。使用选择器，这没什么大不了的，因为它们只会返回标签，但是如果您曾经在 tag.descendants 中执行类似 for x 之类的操作：您会得到文本节点（字符串）和标签，即使空字符串也会评估为 false它是一个有效的节点。在这种情况下，检查 None 是最安全的

score 1 · Accepted Answer

<h3>我建议您搜索包含两者<h3>和引用（内部）的标签，而不是查找所有标签<div class="gs_rs>"，即查找所有<div class="gs_ri">标签。

然后从这些标签中，您应该能够获得所需的一切：

query = ['Role for migratory wild birds in the global spread of avian influenza H5N8','Uncoupling conformational states from activity in an allosteric enzyme','Technological Analysis of the World’s Earliest Shamanic Costume: A Multi-Scalar, Experimental Study of a Red Deer Headdress from the Early Holocene Site of Star Carr, North Yorkshire, UK','Oxidative potential of PM 2.5  during Atlanta rush hour: Measurements of in-vehicle dithiothreitol (DTT) activity','Primary Prevention of CVD','Growth and Deposition of Au Nanoclusters on Polymer- wrapped Graphene and Their Oxygen Reduction Activity','Relations of Preschoolers Visual-Motor and Object Manipulation Skills With Executive Function and Social Behavior','We Know Who Likes Us, but Not Who Competes Against Us']

url = 'https://scholar.google.com/scholar?q=' + query + '&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search'

content = requests.get(url).text
page = BeautifulSoup(content, 'lxml')
results = []
for entry in page.find_all("div", attrs={"class": "gs_ri"}): #tag containing both h3 and citation
    results.append({"title": entry.h3.a.text, "url": entry.a['href'], "citation": entry.find("div", attrs={"class": "gs_rs"}).text})

score 0 · Accepted Answer

确保您正在使用user-agent，因为默认请求user-agent是python-requests并且 Google 可能会阻止您的请求，并且您会收到不同的 HTML，其中包含某种错误，其中不包含您尝试选择的选择器。检查你的user-agent.

在提出请求时轮换user-agents也可能是个好主意。

在在线 IDE 中抓取更多内容的代码和完整示例：

from bs4 import BeautifulSoup
import requests, lxml

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

queries = ['Role for migratory wild birds in the global spread of avian influenza H5N8',
'Uncoupling conformational states from activity in an allosteric enzyme',
'Technological Analysis of the World’s Earliest Shamanic Costume: A Multi-Scalar, Experimental Study of a Red Deer Headdress from the Early Holocene Site of Star Carr, North Yorkshire, UK',
'Oxidative potential of PM 2.5  during Atlanta rush hour: Measurements of in-vehicle dithiothreitol (DTT) activity',
'Primary Prevention of CVD','Growth and Deposition of Au Nanoclusters on Polymer-wrapped Graphene and Their Oxygen Reduction Activity',
'Relations of Preschoolers Visual-Motor and Object Manipulation Skills With Executive Function and Social Behavior',
'We Know Who Likes Us, but Not Who Competes Against Us']

for query in queries:
  params = {
    "q": query,
    "hl": "en",
  }

  html = requests.get('https://scholar.google.com/scholar', headers=headers, params=params, proxies=proxies).text
  soup = BeautifulSoup(html, 'lxml')

  # Container where all needed data is located
  for result in soup.select('.gs_ri'):
    title = result.select_one('.gs_rt').text
    title_link = result.select_one('.gs_rt a')['href']
    cited_by = result.select_one('#gs_res_ccl_mid .gs_nph+ a')['href']
    cited_by_count = result.select_one('#gs_res_ccl_mid .gs_nph+ a').text.split(' ')[2]

    print(f"{title}\n{title_link}\n{cited_by}\n{cited_by_count}\n")

或者，您可以使用来自 SerpApi的Google Scholar Organic Results API来实现相同的目的。这是一个带有免费计划的付费 API。

您的情况的不同之处在于，您只需要遍历结构化 JSON 并获取您想要的数据，而不是弄清楚为什么某些事情不能按应有的方式工作。

要集成的代码：

from serpapi import GoogleSearch
import os

queries = ['Role for migratory wild birds in the global spread of avian influenza H5N8',
'Uncoupling conformational states from activity in an allosteric enzyme',
'Technological Analysis of the World’s Earliest Shamanic Costume: A Multi-Scalar, Experimental Study of a Red Deer Headdress from the Early Holocene Site of Star Carr, North Yorkshire, UK',
'Oxidative potential of PM 2.5  during Atlanta rush hour: Measurements of in-vehicle dithiothreitol (DTT) activity',
'Primary Prevention of CVD','Growth and Deposition of Au Nanoclusters on Polymer-wrapped Graphene and Their Oxygen Reduction Activity',
'Relations of Preschoolers Visual-Motor and Object Manipulation Skills With Executive Function and Social Behavior',
'We Know Who Likes Us, but Not Who Competes Against Us']

for query in queries:
  params = {
    "api_key": os.getenv("API_KEY"),
    "engine": "google_scholar",
    "q": query,
  }

  search = GoogleSearch(params)
  results = search.get_dict()

  data = []

  for result in results['organic_results']:
    data.append({
      'title': result['title'],
      'link': result['link'],
      'publication_info': result['publication_info']['summary'],
      'snippet': result['snippet'],
      'cited_by': result['inline_links']['cited_by']['link'],
      'related_versions': result['inline_links']['related_pages_link'],
    })

    print(json.dumps(data, indent=2, ensure_ascii=False))

PS - 我写了一篇关于如何用视觉表示在Google Scholar上抓取几乎所有内容的博客文章。

免责声明，我为 SerpApi 工作。

python - 从 Google Scholar 搜索结果中抓取和解析引文信息

3 回答 3

Related

Reference