2

我正在尝试获取网站(国家美术馆)上各个搜索结果的链接。但搜索链接不会加载搜索结果。这是我尝试这样做的方法:

url = 'https://www.nga.gov/collection-search-result.html?artist=C%C3%A9zanne%2C%20Paul'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

我可以看到可以在下面找到各个结果的链接,soup.findAll('a')但它们没有出现,而是最后一个输出是指向空搜索结果的链接: https ://www.nga.gov/content/ngaweb/collection-search-结果.html

我怎么能得到一个链接列表,第一个是第一个搜索结果(https://www.nga.gov/collection/art-object-page.52389.html),第二个是第二个搜索结果(https://www.nga.gov/collection/art-object-page.52085.html)等?

4

2 回答 2

1

实际上,数据是从 api 调用 json 响应生成的。这是所需的链接列表。

代码:

import requests
import json

url= 'https://www.nga.gov/collection-search-result/jcr:content/parmain/facetcomponent/parList/collectionsearchresu.pageSize__30.pageNumber__1.json?artist=C%C3%A9zanne%2C%20Paul&_=1634762134895'
r = requests.get(url)

for item in r.json()['results']:
    url = item['url']
    abs_url = f'https://www.nga.gov{url}'
    print(abs_url)

输出:

https://www.nga.gov/content/ngaweb/collection/art-object-page.52389.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.52085.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.46577.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.46580.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.46578.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.136014.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.46576.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.53120.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.54129.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.52165.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.46575.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.53122.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.93044.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.66405.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.53119.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.53121.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.46579.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.66406.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.45866.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.53123.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.45867.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.45986.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.45877.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.136025.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.74193.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.74192.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.66486.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.76288.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.76223.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.76268.html
于 2021-10-20T20:53:06.640 回答
0

这似乎对我有用:


from bs4 import BeautifulSoup
import requests
url = 'https://www.nga.gov/collection-search-result.html?artist=C%C3%A9zanne%2C%20Paul'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for a in soup.findAll('a'):
    print(a['href'])

它返回所有 html a href 链接。

对于搜索结果中的链接,这些链接是通过 AJAX 加载的,您需要实现一些像无头 chrome 一样呈现 javascript 的东西。您可以在此处阅读有关实现此功能的一种方法,它非常适合您的用例。http://theautomatic.net/2019/01/19/scraping-data-from-javascript-webpage-python/

如果您想询问如何从 python 渲染 javascript 然后解析结果,您需要关闭此问题并打开一个新问题,因为它的范围不正确。

于 2021-10-20T20:39:26.733 回答