python-3.x - 在 Python3 中使用 request_html 和 BeautifulSoup 使用 select/option 抓取 Web 数据

Question

我是数据抓取的新手，但我不会在没有四处寻找合适答案的情况下粗心地问这个问题。

我想从此页面下载表格：https ://www.portodemanaus.com.br/?pagina=nivel-do-rio-negro-hoje 。

正如您从以下屏幕截图中看到的那样，表格顶部有几个选择/选项。对应的html代码（右侧）显示选择了下半年（2）和2021年。通过重新选择并重新提交表单，表格的内容会发生变化，但 url 保持不变。但是，更改会反映在 html 代码中。请参见下面的第二个屏幕截图，其中选项被修改为 1 和 2018。

基于这些检查，我整理了一个 python 脚本（使用 bs4 和 requests_html）来获取初始页面，修改选择/选项，然后将它们发布回 url。请参阅下面的代码。但是，它的任务失败了。网页不响应修改。任何人都可以解释一下吗？

提前致谢，

梁

from bs4 import BeautifulSoup
from requests_html import HTMLSession
from urllib.parse import urljoin

url = "https://www.portodemanaus.com.br/?pagina=nivel-do-rio-negro-hoje#"

# initialize an HTTP session
session = HTMLSession()

# Get request
res = session.get(url)

# for javascript driven website
# res.html.render()
soup = BeautifulSoup(res.html.html, "html.parser")

# Get all select tags
selects = soup.find_all("select")

# Modify select tags
# Select the first half of a year
selects[0].contents[1].attrs['selected']=''
del selects[0].contents[3].attrs['selected']

# Put into a dictionary
data = {}
data[selects[0]['name']] = selects[0]
data[selects[1]['name']] = selects[1]

# Post it back to the website
res = session.post(url, data=data)

# Remake the soup after the modification
soup = BeautifulSoup(res.content, "html.parser")

# the below code is only for replacing relative URLs to absolute ones
for link in soup.find_all("link"):
    try:
        link.attrs["href"] = urljoin(url, link.attrs["href"])
    except:
        pass
for script in soup.find_all("script"):
    try:
        script.attrs["src"] = urljoin(url, script.attrs["src"])
    except:
        pass
for img in soup.find_all("img"):
    try:
        img.attrs["src"] = urljoin(url, img.attrs["src"])
    except:
        pass
for a in soup.find_all("a"):
    try:
        a.attrs["href"] = urljoin(url, a.attrs["href"])
    except:
        pass

# write the page content to a file
open("page.html", "w").write(str(soup))

score 3 · Accepted Answer

该选项可以通过 POST 并semestre作为ano参数传入和传递。例如：

import pandas as pd
import requests

semestre = 1
ano = 2018

url = 'https://www.portodemanaus.com.br/?pagina=nivel-do-rio-negro-hoje'
payload = {
'semestre': '%s' %semestre,
'ano': '%s' %ano,
'buscar': 'Buscar'}

response = requests.post(url, params=payload)
df = pd.read_html(response.text)[7]

输出：

print(df)
              0         1   ...        11                  12
0           Dias     Julho  ...  Dezembro            Dezembro
1           Dias  Cota (m)  ...  Cota (m)  Encheu/ Vazou (cm)
2              1      2994  ...       000                 000
3              2      2991  ...       000                 000
4              3      2989  ...       000                 000
5              4      2988  ...       000                 000
6              5      2987  ...       000                 000
7              6      2985  ...       000                 000
8              7      2983  ...       000                 000
9              8      2980  ...       000                 000
10             9      2977  ...       000                 000
11            10      2975  ...       000                 000
12            11      2972  ...       000                 000
13            12      2969  ...       000                 000
14            13      2967  ...       000                 000
15            14      2965  ...       000                 000
16            15      2962  ...       000                 000
17            16      2959  ...       000                 000
18            17      2955  ...       000                 000
19            18      2951  ...       000                 000
20            19      2946  ...       000                 000
21            20      2942  ...       000                 000
22            21      2939  ...       000                 000
23            22      2935  ...       000                 000
24            23      2931  ...       000                 000
25            24      2927  ...       000                 000
26            25      2923  ...       000                 000
27            26      2918  ...       000                 000
28            27      2912  ...       000                 000
29            28      2908  ...       000                 000
30            29      2902  ...       000                 000
31            30      2896  ...       000                 000
32            31      2892  ...       000                 000
33  Estatísticas    Encheu  ...   Estável             Estável
34  Estatísticas     Vazou  ...   Estável             Estável
35  Estatísticas    Mínima  ...    Mínima                 000
36  Estatísticas     Média  ...     Média                 000
37  Estatísticas    Máxima  ...    Máxima                 000

[38 rows x 13 columns]

python-3.x - 在 Python3 中使用 request_html 和 BeautifulSoup 使用 select/option 抓取 Web 数据

1 回答 1

Related

Reference