0

我正在尝试从谷歌学者为特定查询提供的测试片段中提取文本。文本片段是指标题下方的文本(黑色字母)。目前我正在尝试使用python从html文件中提取它,但它包含很多额外的测试,例如

/div><div class="gs_fl"...ETC。

有没有一种简单的方法或一些代码可以帮助我在没有这些冗余文本的情况下获取文本。

4

2 回答 2

1

你需要一个 html 解析器:

import lxml.html

doc = lxml.html.fromstring(html)
text = doc.xpath('//div[@class="gs_fl"]').text_content()

您可以使用“pip install lxml”安装 lxml,但您需要构建它的依赖项,并且详细信息会根据您的平台而有所不同。

于 2013-04-02T16:18:02.130 回答
0

旧的,但现在可能是一个相关的问题。使用SelectorGadgets轻松抓取 CSS 选择器。确保您使用的是代理,否则即使您尝试通过selenium.

在线 IDE 中的代码和完整示例:

from bs4 import BeautifulSoup
import requests, lxml, os

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

proxies = {
  'http': os.getenv('HTTP_PROXY')
}

html = requests.get('https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=samsung&oq=', headers=headers, proxies=proxies).text
soup = BeautifulSoup(html, 'lxml')

for result in soup.select('.gs_ri'):
  snippet = result.select_one('.gs_rs').text
  print(f"Snippet: {snippet}")

部分输出:

Snippet: Purpose–Extensive research has shown that country‐of‐origin (COO) information significantly affects product evaluations and buying behavior. Yet recently, a competing perspective has emerged suggesting that COO effects have been inflated in prior research …

或者,您可以使用来自 SerpApi 的 Google Scholar Organic Search Results API。这是一个付费 API,可免费试用 5,000 次搜索。

本质上,它和上面的脚本做同样的事情,除了你不需要考虑如何解决验证码或找到一个好的代理(代理)。

要集成的代码:

from serpapi import GoogleSearch
import os

params = {
  "api_key": os.getenv("API_KEY"),
  "engine": "google_scholar",
  "q": "samsung",
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results['organic_results']:
  print(f"Snippet: {result['snippet']}")

部分输出:

Snippet: Purpose–Extensive research has shown that country‐of‐origin (COO) information significantly affects product evaluations and buying behavior. Yet recently, a competing perspective has emerged suggesting that COO effects have been inflated in prior research …

免责声明,我为 SerpApi 工作。

于 2021-05-23T13:19:10.637 回答