python - Parsing Google Scholar results with Python and BeautifulSoup

Question

Given a typical keyword search in Google Scholar (see screenshot), I want to get a dictionary containing the title and url of each publication appearing on the page (eg. results = {'title': 'Cytosolic calcium regulates ion channels in the plasma membrane of Vicia faba guard cells', 'url': 'https://www.nature.com/articles/338427a0'}.

To retrieve the results page from Google Scholar, I am using the following code:

from urllib import FancyURLopener, quote_plus
from bs4 import BeautifulSoup

class AppURLOpener(FancyURLopener):
    version = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36'

openurl = AppURLOpener().open
query = "Vicia faba"
url = 'https://scholar.google.com/scholar?q=' + quote_plus(query) + '&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search'
#print url
content = openurl(url).read()
page = BeautifulSoup(content, 'lxml')
print page

This code correctly returns the results page, in (very ugly) HTML format. However, I have not been been able to progress beyond this point, as I could not figure out how to use BeautifulSoup (to which I am not too much familiarized) to parse the results page and retrieve the data.

Notice that the issue is with the parsing of and extracting of data from the results page, not with Google Scholar itself, since the results page is correctly retrieved by the above code.

Could anyone please give a few hints? Thanks in advance!

score 8 · Accepted Answer

检查页面内容显示搜索结果被包装在一个h3标签中，带有属性class="gs_rt"。您可以使用 BeautifulSoup 仅提取这些标签，然后从<a>每个条目内的标签中获取标题和 URL。将每个标题/ URL 写入字典，并存储在字典列表中：

import requests
from bs4 import BeautifulSoup

query = "Vicia%20faba"
url = 'https://scholar.google.com/scholar?q=' + query + '&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search'

content = requests.get(url).text
page = BeautifulSoup(content, 'lxml')
results = []
for entry in page.find_all("h3", attrs={"class": "gs_rt"}):
    results.append({"title": entry.a.text, "url": entry.a['href']})

输出：

[{'title': 'Cytosolic calcium regulates ion channels in the plasma membrane of Vicia faba guard cells',
  'url': 'https://www.nature.com/articles/338427a0'},
 {'title': 'Hydrogen peroxide is involved in abscisic acid-induced stomatal closure in Vicia faba',
  'url': 'http://www.plantphysiol.org/content/126/4/1438.short'},
 ...]

注意：我使用requests而不是urllib，因为我urllib不会加载FancyURLopener. 但是 BeautifulSoup 的语法应该是一样的，不管你如何获取页面内容。

score 1 · Accepted Answer

在回答这个问题时，andrew_reece的回答不起作用，即使h3具有正确类的标签位于源代码中，它仍然会引发错误，例如获取验证码，因为 Google 将您的脚本检测为自动脚本。打印响应以查看消息。

我在发送太多请求后得到了这个：

The block will expire shortly after those requests stop.
Sometimes you may be asked to solve the CAPTCHA
if you are using advanced terms that robots are known to use, 
or sending requests very quickly.

您可以做的第一件事是在您的请求中添加代理：

#https://docs.python-requests.org/en/master/user/advanced/#proxies

proxies = {
  'http': os.getenv('HTTP_PROXY') # Or just type your proxy here without os.getenv()
}

请求代码将是这样的：

html = requests.get('google scholar link', headers=headers, proxies=proxies).text

requests-HTML或者，您可以通过使用orselenium或pyppeteer而不使用代理来使其工作，只需渲染页面即可。

代码：

# If you'll get an empty array, this means you get a CAPTCHA. 

from requests_html import HTMLSession
import json

session = HTMLSession()
response = session.get('https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=vicia+faba&btnG=')

# https://requests-html.kennethreitz.org/#javascript-support
response.html.render()

results = []

# Container where data we need is located
for result in response.html.find('.gs_ri'):
    title = result.find('.gs_rt', first = True).text
    # print(title)
    
    # converting dict of URLs to strings (see how it will be without next() iter())
    url = next(iter(result.absolute_links))
    # print(url)

    results.append({
        'title': title,
        'url': url,
    })

print(json.dumps(results, indent = 2, ensure_ascii = False))

部分输出：

[
  {
    "title": "Faba bean (Vicia faba L.)",
    "url": "https://www.sciencedirect.com/science/article/pii/S0378429097000257"
  },
  {
    "title": "Nutritional value of faba bean (Vicia faba L.) seeds for feed and food",
    "url": "https://scholar.google.com/scholar?cluster=956029896799880103&hl=en&as_sdt=0,5"
  }
]

本质上，您可以使用来自SerpApi的 Google Scholar API执行相同的操作。但是您不必呈现页面或使用浏览器自动化，例如从 Google Scholar 获取数据。获取即时 JSON 输出，这将比或更快，无需考虑如何绕过 Google 阻止。seleniumseleniumreqests-html

这是一个付费 API，可试用 5,000 次搜索。目前正在开发完全免费的试用版。

要集成的代码：

from serpapi import GoogleSearch
import json

params = {
  "api_key": "YOUR_API_KEY",
  "engine": "google_scholar",
  "q": "vicia faba",
  "hl": "en"
}

search = GoogleSearch(params)
results = search.get_dict()

results_data = []

for result in results['organic_results']:
    title = result['title']
    url = result['link']

    results_data.append({
        'title': title,
        'url': url,
    })
    
print(json.dumps(results_data, indent = 2, ensure_ascii = False))

部分输出：

[
  {
    "title": "Faba bean (Vicia faba L.)",
    "url": "https://www.sciencedirect.com/science/article/pii/S0378429097000257"
  },
  {
    "title": "Nutritional value of faba bean (Vicia faba L.) seeds for feed and food",
    "url": "https://www.sciencedirect.com/science/article/pii/S0378429009002512"
  },
]

免责声明，我为 SerpApi 工作。

python - Parsing Google Scholar results with Python and BeautifulSoup

2 回答 2

Related

Reference