python - 如何从网页下载与标题中某个字符串对应的所有文件？

Question

我需要从此页面下载所有文件：

https://www.dmo.gov.uk/publications/?offset=0&itemsPerPage=1000000&parentFilter=1433&childFilter=1433%7C1450&startMonth=1&startYear=2008&endMonth=6&endYear=2021

标题上有“拍卖”的。这是其中一个文件的来源，例如：

<a href="/media/17527/pr090621b.pdf" aria-label="Auction of £2,500 million  of 0 5/8% Treasury Gilt 2035, published 09 June 2021">Auction of £2,500 million  of 0 5/8% Treasury Gilt 2035</a>

我正在尝试修改从另一个问题中找到的一些代码，但页面返回为空：

import os
import re
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor

def download_pgn(task):
    session, url, destination_path = task
    response = session.get(url)
    response.raise_for_status()

    soup = BeautifulSoup(response.text, "lxml")
    game_url = host + soup.find("a", text="download").get("href")
    filename = re.search(r"\w+\.pgn", game_url).group()
    path = os.path.join(destination_path, filename)
    response = session.get(game_url, stream=True)
    response.raise_for_status()

    with open(path, "wb") as f:
        for chunk in response.iter_content(chunk_size=1024):
            if chunk:
                f.write(chunk)

if __name__ == "__main__":
    
    destination_path = "pgns"
    max_workers = 8

    if not os.path.exists(destination_path):
        os.makedirs(destination_path)
    
    with requests.Session() as session:
        response = session.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, "lxml")
        pages = soup.find_all("a", href=re.compile(r".*Auction of\?.*"))
        tasks = [
            (session, host + page.get("href"), destination_path) 
            for page in pages
        ]

        with ThreadPoolExecutor(max_workers=max_workers) as pool:
            pool.map(download_pgn, tasks)

score 1 · Accepted Answer

检查您的正则表达式语法。正则表达式r".*Auction of\?.*"只会匹配标题中实际of?存在的标题。

但是 href= 参数会根据链接中的 URL 进行搜索，所以这对您也没有多大帮助。这将找到具有匹配标题的链接：

links = soup.find_all("a", string=re.compile(r"Auction of\b"))

这将提取它们的 URL，以便您可以检索它们：

[ file["href"] for file in links ]

score 0 · Accepted Answer

该find_all()方法接受一个函数。您可以创建一个lambda函数来过滤所有a包含“Auction of”的标签：

for tag in soup.find_all(lambda t: t.name == "a" and "Auction of" in t):
    print(tag.text)

或者，您可以使用[attribute*=value]：

# Find all `aria-label` attributes under an `a` that contain `Auction of`
for tag in soup.select("a[aria-label*='Auction of']"):
    print(tag.text)

score 0 · Accepted Answer

这最终为我工作：

from bs4 import BeautifulSoup
import requests
import re
links = []
url = 'https://www.dmo.gov.uk/publications/?offset=0&itemsPerPage=1000000000&parentFilter=1433&childFilter=1433|1450&startMonth=1&startYear=2000&endMonth=6&endYear=2021'
req = requests.get(url)
soup = BeautifulSoup(req.text, "lxml")
for a in soup.find_all("a",{"aria-label":re.compile(r"^Auction of\b")}, href=True):
    links.append(a['href'])

def download_file(url):
    path = url.split('/')[-1].split('?')[0]
    r = requests.get(url, stream=True)
    if r.status_code == 200:
        with open(path, 'wb') as f:
            for chunk in r:
                f.write(chunk)

host = 'https://www.dmo.gov.uk/'

for link in links:
    url = host + link
    download_file(url)

python - 如何从网页下载与标题中某个字符串对应的所有文件？

3 回答 3

Related

Reference