python - 识别从 Google Scholar 检索 href 的问题

Question

无法从谷歌学者中抓取链接和文章名称。我不确定问题出在我的代码还是我用来检索数据的 xpath 上——或者可能两者兼而有之？

在过去的几个小时里，我已经尝试调试/咨询其他 stackoverflow 查询，但没有成功。

import scrapy
from scrapyproj.items import ScrapyProjItem

class scholarScrape(scrapy.Spider):

    name = "scholarScraper"
    allowed_domains = "scholar.google.com"
    start_urls=["https://scholar.google.com/scholar?hl=en&oe=ASCII&as_sdt=0%2C44&q=rare+disease+discovery&btnG="]

    def parse(self,response):
        item = ScrapyProjItem()
        item['hyperlink'] = item.xpath("//h3[class=gs_rt]/a/@href").extract()
        item['name'] = item.xpath("//div[@class='gs_rt']/h3").extract()
        yield item

我收到的错误消息说：“AttributeError：xpath”所以我认为问题出在我用来尝试检索数据的路径上，但我也可能弄错了吗？

score 1 · Accepted Answer

添加我的评论作为答案，因为它解决了问题：

问题在于scrapyproj.items.ScrapyProjItem对象：它们没有xpath属性。这是官方的scrapy课程吗？xpath我想你的意思是呼吁response：

item['hyperlink'] = response.xpath("//h3[class=gs_rt]/a/@href").extract()
item['name'] = response.xpath("//div[@class='gs_rt']/h3").extract()

此外，第一个路径表达式可能需要一组围绕属性值“gs_rt”的引号：

item['hyperlink'] = response.xpath("//h3[class='gs_rt']/a/@href").extract()

除此之外，XPath 表达式也很好。

score 0 · Accepted Answer

使用替代解决方案bs4：

from bs4 import BeautifulSoup
import requests, lxml, os

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

html = requests.get('https://scholar.google.com/citations?hl=en&user=m8dFEawAAAAJ', headers=headers).text
soup = BeautifulSoup(html, 'lxml')

# Container where all articles located
for article_info in soup.select('#gsc_a_b .gsc_a_t'):
  # title CSS selector
  title = article_info.select_one('.gsc_a_at').text
  # Same title CSS selector, except we're trying to get "data-href" attribute
  # Note, it will be relative link, so you need to join it with absolute link after extracting.
  title_link = article_info.select_one('.gsc_a_at')['data-href']
  print(f'Title: {title}\nTitle link: https://scholar.google.com{title_link}\n')

# Part of the output:
'''
Title: Automating Gödel's Ontological Proof of God's Existence with Higher-order Automated Theorem Provers.
Title link: https://scholar.google.com/citations?view_op=view_citation&hl=en&user=m8dFEawAAAAJ&citation_for_view=m8dFEawAAAAJ:-f6ydRqryjwC
'''

或者，您可以使用来自 SerpApi的Google Scholar Author Articles API执行相同的操作。

主要区别在于，即使您使用的是selenium. 这是一个带有免费计划的付费 API。

要集成的代码：

from serpapi import GoogleSearch
import os

params = {
  "api_key": os.getenv("API_KEY"),
  "engine": "google_scholar_author",
  "author_id": "9PepYk8AAAAJ",
}

search = GoogleSearch(params)
results = search.get_dict()

for article in results['articles']:
  article_title = article['title']
  article_link = article['link']

# Part of the output:
'''
Title: p-GaN gate HEMTs with tungsten gate metal for high threshold voltage and low gate current
Link: https://scholar.google.com/citations?view_op=view_citation&hl=en&user=9PepYk8AAAAJ&citation_for_view=9PepYk8AAAAJ:bUkhZ_yRbTwC
'''

免责声明，我为 SerpApi 工作。

python - 识别从 Google Scholar 检索 href 的问题

2 回答 2

Related

Reference