python-3.x - 如何制作一个分页循环来抓取特定数量的页面（页面每天都在变化）

Question

概括

我正在从事我的供应链管理学院项目，并希望分析网站上的每日帖子，以分析和记录行业对服务/产品的需求。每天都在更改的特定页面，并且具有不同数量的容器和页面：

https://buyandsell.gc.ca/procurement-data/search/site?f%5B0%5D=sm_facet_procurement_data%3Adata_data_tender_notice&f%5B1%5D=dds_facet_date_published%3Adds_facet_date_published_today

免费

代码通过抓取 HTML 标签和记录数据点来生成 csv 文件（不要介意标题）。尝试使用“for”循环，但代码仍然只扫描第一页。

Python 知识级别：初学者，通过 youtube 和谷歌搜索学习“艰难的方式”。找到了适合我理解水平的示例，但在结合人们的不同解决方案时遇到了麻烦。

此刻的代码

从 urllib.request 导入 bs4 将 urlopen 作为 uReq 从 bs4 导入 BeautifulSoup 作为汤

问题从这里开始

for page in range (1,3):my_url = 'https://buyandsell.gc.ca/procurement-data/search/site?f%5B0%5D=sm_facet_procurement_data%3Adata_data_tender_notice&f%5B1%5D=dds_facet_date_published%3Adds_facet_date_published_today'

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
containers = page_soup.findAll("div",{"class":"rc"})

这部分不写除了现有的行项目

filename = "BuyandSell.csv"
f = open(filename, "w")
headers = "Title, Publication Date, Closing Date, GSIN, Notice Type, Procurement Entity\n"
f.write(headers)

for container in containers:
    Title = container.h2.text

    publication_container = container.findAll("dd",{"class":"data publication-date"})
    Publication_date = publication_container[0].text

    closing_container = container.findAll("dd",{"class":"data date-closing"})
    Closing_date = closing_container[0].text

    gsin_container = container.findAll("li",{"class":"first"})
    Gsin = gsin_container[0].text

    notice_container = container.findAll("dd",{"class":"data php"})
    Notice_type = notice_container[0].text

    entity_container = container.findAll("dd",{"class":"data procurement-entity"})
    Entity = entity_container[0].text

    print("Title: " + Title)
    print("Publication_date: " + Publication_date)
    print("Closing_date: " + Closing_date)
    print("Gsin: " + Gsin)
    print("Notice: " + Notice_type)
    print("Entity: " + Entity)

    f.write(Title + "," +Publication_date + "," +Closing_date + "," +Gsin + "," +Notice_type + "," +Entity +"\n")

f.close()

如果您想进一步了解，请告诉我。Rest 正在定义在 HTML 代码中找到并打印到 csv 的数据容器。任何帮助/建议将不胜感激。谢谢！

实际结果：

代码仅为第一页生成 CSV 文件。

代码至少不会写在已扫描的内容之上（每天）

预期成绩：

代码扫描下一页并识别何时没有要浏览的页面。

CSV 文件每页会生成 10 个 csv 行。（无论最后一页上的数量是多少，因为数字并不总是 10）。

代码将写在已经抓取的内容之上（使用带有历史数据的 Excel 工具进行更高级的分析）

score 0 · Accepted Answer

有人可能会说使用 pandas 太过分了，但我个人很喜欢使用它，就像使用它来创建表和写入文件一样。

可能还有一种更强大的方式来逐页浏览，但我只是想把它交给你，你可以使用它。

到目前为止，我只是在下一页值中硬编码（我只是随意选择最多 20 页。）所以它从第 1 页开始，然后通过 20 页（或者一旦到达无效页面就停止） .

import pandas as pd
from bs4 import BeautifulSoup
import requests
import os

filename = "BuyandSell.csv"

# Initialize an empty 'results' dataframe
results = pd.DataFrame()

# Iterarte through the pages
for page in range(0,20):
    url = 'https://buyandsell.gc.ca/procurement-data/search/site?page=' + str(page) + '&f%5B0%5D=sm_facet_procurement_data%3Adata_data_tender_notice&f%5B1%5D=dds_facet_date_published%3Adds_facet_date_published_today'

    page_html = requests.get(url).text
    page_soup = BeautifulSoup(page_html, "html.parser")
    containers = page_soup.findAll("div",{"class":"rc"})

    # Get data from each container
    if containers != []:
        for each in containers:
            title = each.find('h2').text.strip()
            publication_date = each.find('dd', {'class':'data publication-date'}).text.strip()
            closing_date = each.find('dd', {'class':'data date-closing'}).text.strip()
            gsin = each.find('dd', {'class':'data gsin'}).text.strip()
            notice_type = each.find('dd', {'class':'data php'}).text.strip()
            procurement_entity = each.find('dd', {'data procurement-entity'}).text.strip()

            # Create 1 row dataframe
            temp_df = pd.DataFrame([[title, publication_date, closing_date, gsin, notice_type, procurement_entity]], columns = ['Title', 'Publication Date', 'Closing Date', 'GSIN', 'Notice Type', 'Procurement Entity'])

            # Append that row to a 'results' dataframe
            results = results.append(temp_df).reset_index(drop=True)
        print ('Aquired page ' + str(page+1))

    else:
        print ('No more pages')
        break


# If already have a file saved
if os.path.isfile(filename):

    # Read in previously saved file
    df = pd.read_csv(filename)

    # Append the newest results
    df = df.append(results).reset_index()

    # Drop and duplicates (incase the newest results aren't really new)
    df = df.drop_duplicates()

    # Save the previous file, with appended results
    df.to_csv(filename, index=False)

else:

    # If a previous file not already saved, save a new one
    df = results.copy()
    df.to_csv(filename, index=False)

python-3.x - 如何制作一个分页循环来抓取特定数量的页面（页面每天都在变化）

概括

免费

此刻的代码

问题从这里开始

这部分不写除了现有的行项目

如果您想进一步了解，请告诉我。Rest 正在定义在 HTML 代码中找到并打印到 csv 的数据容器。任何帮助/建议将不胜感激。谢谢！

1 回答 1

Related

Reference