我是数据抓取的新手,但我不会在没有四处寻找合适答案的情况下粗心地问这个问题。
我想从此页面下载表格:https ://www.portodemanaus.com.br/?pagina=nivel-do-rio-negro-hoje 。
正如您从以下屏幕截图中看到的那样,表格顶部有几个选择/选项。对应的html代码(右侧)显示选择了下半年(2)和2021年。通过重新选择并重新提交表单,表格的内容会发生变化,但 url 保持不变。但是,更改会反映在 html 代码中。请参见下面的第二个屏幕截图,其中选项被修改为 1 和 2018。
基于这些检查,我整理了一个 python 脚本(使用 bs4 和 requests_html)来获取初始页面,修改选择/选项,然后将它们发布回 url。请参阅下面的代码。但是,它的任务失败了。网页不响应修改。任何人都可以解释一下吗?
提前致谢,
梁
from bs4 import BeautifulSoup
from requests_html import HTMLSession
from urllib.parse import urljoin
url = "https://www.portodemanaus.com.br/?pagina=nivel-do-rio-negro-hoje#"
# initialize an HTTP session
session = HTMLSession()
# Get request
res = session.get(url)
# for javascript driven website
# res.html.render()
soup = BeautifulSoup(res.html.html, "html.parser")
# Get all select tags
selects = soup.find_all("select")
# Modify select tags
# Select the first half of a year
selects[0].contents[1].attrs['selected']=''
del selects[0].contents[3].attrs['selected']
# Put into a dictionary
data = {}
data[selects[0]['name']] = selects[0]
data[selects[1]['name']] = selects[1]
# Post it back to the website
res = session.post(url, data=data)
# Remake the soup after the modification
soup = BeautifulSoup(res.content, "html.parser")
# the below code is only for replacing relative URLs to absolute ones
for link in soup.find_all("link"):
try:
link.attrs["href"] = urljoin(url, link.attrs["href"])
except:
pass
for script in soup.find_all("script"):
try:
script.attrs["src"] = urljoin(url, script.attrs["src"])
except:
pass
for img in soup.find_all("img"):
try:
img.attrs["src"] = urljoin(url, img.attrs["src"])
except:
pass
for a in soup.find_all("a"):
try:
a.attrs["href"] = urljoin(url, a.attrs["href"])
except:
pass
# write the page content to a file
open("page.html", "w").write(str(soup))