python - Google Scholar 上的网页抓取不断返回一个空列表

Question

我正在尝试为 uni 搜索网页，但从 Google Scholar 很难做到这一点。我已经尝试了很多事情，显然它与.json().

我想制作一个输入苹果和三星等品牌的函数，并返回一个标题列表及其各自的摘要。

请有人可以在这里帮助我！谢谢！下面，我已经写了到目前为止的内容，并列出了我尝试过的其他一些事情。

from bs4 import BeautifulSoup
import requests
import csv
import json

brand = input("Enter Technology:  ")
source = requests.get('https://scholar.google.com/scholar?0&q={0}+technology'.format(brand)).text
soup = BeautifulSoup(source, 'lxml')

#script = soup.select_one('[type="application/ld+json"]').text
#data = json.loads(script)
#soup = BeautifulSoup(data['description'], 'lxml')

headers = soup.find_all('div', class_="gs_rt")

print(headers)

score 1 · Accepted Answer

您可以做的第一件事是在您的请求中添加代理：

#https://docs.python-requests.org/en/master/user/advanced/#proxies

proxies = {
  'http': os.getenv('HTTP_PROXY') # Or just type your proxy here without os.getenv()
}

请求代码将是这样的：

html = requests.get('google scholar link', headers=headers, proxies=proxies).text

或者，更天真的方法是在每个请求之间设置随机中断，或者绕过它，您可以使用selenium或requests-html不pyppeteer使用代理来呈现页面，但如果您同时发送太多请求，它仍然可能会阻止您的请求。

'''
If you'll get an empty array, this means you get a CAPTCHA. 
Print response text to see what is going on or wait sometime before sending requests again.
'''

from requests_html import HTMLSession

session = HTMLSession()
response = session.get('https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=samsung&btnG=')

# https://requests-html.kennethreitz.org/#javascript-support
response.html.render()

# Container where data we need is located
for result in response.html.find('.gs_ri'):
    title = result.find('.gs_rt', first = True).text
    print(title)

或者，您可以使用SerpApi 的Google Scholar API从 Google Scholar 抓取数据。无需考虑如何绕过 Google 阻止或呈现 Javascript 页面。这是一个带有免费计划的付费 API。

要集成的代码：

from serpapi import GoogleSearch

params = {
  "api_key": "YOUR_API_KEY",
  "engine": "google_scholar",
  "q": "samsung",
  "hl": "en"
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results['organic_results']:
    print(f'Title: result['title']')

免责声明，我为 SerpApi 工作。

score 0 · Accepted Answer

Google 学者是 javascript 启用网站使用 selenium 抓取网站将是一个完美的解决方案，更多详细信息请参阅此处

score 0 · Accepted Answer

Google Scholar 链接到不同的网站，如sciencedirect、acm 等……我只为sciencedirect 和acm 添加了选择器。如果需要，您可以添加更多。Google 学者使用索引分页，例如第 1 页start为 0，第 2 页start为 10。以下脚本要求提供品牌和要抓取的页数。它保存 2 个文件 - 一个 json 和一个 csv。

from bs4 import BeautifulSoup
import requests, time
import pandas as pd
import json

headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}

brand = input("Enter Technology:  ")
pages = int(input("Number of pages: "))
url = "https://scholar.google.com/scholar?start={}&q={}+technology&hl=en&as_sdt=0,5"

data = []
for i in range(0,pages*10+1,10):
    print(url.format(i, brand))
    res = requests.get(url.format(i, brand),headers=headers)
    main_soup = BeautifulSoup(res.text, "html.parser")
    divs = main_soup.find_all("div", class_="gs_r gs_or gs_scl")
    for div in divs:
        temp = {}
        h3 = div.find("h3", class_="gs_rt")
        temp["Link"] = h3.find("a")["href"]
        temp["Heading"] = h3.find("a").get_text(strip=True)
        temp["Authors"] = div.find("div",class_="gs_a").get_text(strip=True)
        print(temp["Link"])
        try:
            res_link = requests.get(temp["Link"], headers=headers)
            soup_link = BeautifulSoup(res_link.text,"html.parser")
            if "sciencedirect" in temp["Link"]:
                temp["Abstract"] = soup_link.find("div", class_="abstract author").find("div").get_text(strip=True)
            elif "acm" in temp["Link"]:
                temp["Abstract"] = soup_link.find("div", class_="abstractSection abstractInFull").get_text(strip=True)
        except: pass
        data.append(temp)
        time.sleep(1)

with open("data.json", "w") as f:
    json.dump(data,f)

pd.DataFrame(data).to_csv("data.csv", index=False)

输出：

Link,Heading,Authors,Abstract
https://www.sciencedirect.com/science/article/pii/0149197096000078,Development of pyroprocessingtechnology,"JJ Laidler, JE Battles, WE Miller, JP Ackerman… - Progress in Nuclear …, 1997 - Elsevier","A compact, efficient method for recycling IFR fuel is being developed. This method, known as pyroprocessing, capitalizes on the use of metal fuel in the IFR and provides separation of actinide elements from fission products by means of an electrorefining step. The process of electrorefining is based on well-understood electrochemical concepts, the applications of which are described in this chapter. With only the addition of head-end processing steps, the pyroprocess can be applied with equal success to fuel types other than metal, enabling a symbiotic system wherein the IFR can be used to fission the actinide elements in spent nuclear fuel from other types of reactor."
https://www.sciencedirect.com/science/article/pii/S0041624X97001467,Acoustic wave sensors and theirtechnology,"MJ Vellekoop - Ultrasonics, 1998 - Elsevier","In the past two decades, acoustic-wave devices have gained enormous interest for sensor applications. The delay line device, where a transmitting and a receiving interdigital transducer are realized on a (piezoelectric) substrate is the most common structure used. The sensitive part is the surface between the two transducers. By placing the device in the feedback loop of an amplifier, an acoustic-wave oscillator is formed with properties such as inherent high sensitivity, high resolution, high stability and a frequency output signal which is easy to process.A very interesting development is the large amount of wave types now available for sensor applications. Sensors have been published using Rayleigh waves, Lamb waves, Love waves, acoustic plate modes, and surface transverse waves (STW). Each of these wave types have their special advantages and disadvantages with respect to sensitivity, stability, usability in liquids or gases, and fabrication complexity. For the fabrication of the acoustic-wave devices, planar technologies are used, which will be discussed in the paper. Examples will be given of gas sensors, biochemical sensors in liquids, viscosity and density sensing and high-voltage sensing. A comparison of the usability of the different wave types will be presented."
https://www.sciencedirect.com/science/article/pii/0167268188900558,Technologyand transaction cost economics: a reply,"OE Williamson - Journal of Economic Behavior & Organization, 1988 - Elsevier","I argue here, as I have previously, that technology is neither fully determinative of nor irrelevant to economic organization. Transaction cost economizing occupies a prominent position in any effort to assess the efficacy of alternative forms of economic organization."
https://www.sciencedirect.com/science/article/pii/0048733394900140,Learning by trying: the implementation of configurationaltechnology,"J Fleck- Research policy, 1994 - Elsevier","In this paper some issues concerning the nature of technological development are examined, with particular reference to a case study of the implementation of Computer Aided Production Management (CAPM). CAPM is an example of a configurational technology, built up to meet specific organizational requirements. It is argued that there is scope in the development of configurations for significant innovation to take place during implementation itself, through a distinctive form of learning by ‘struggling to get it to work’, or ‘learning by trying’. Some policy implications are outlined in conclusion: the need to recognize the creative opportunities available in this type of development, and the need to facilitate industrial sector-based learning processes."
...
...
...

python - Google Scholar 上的网页抓取不断返回一个空列表

3 回答 3

Related

Reference