python-3.x - 网页抓取 - 非 href

Question

我有一个 csv 中的网站列表，我想在其上捕获所有 pdf。

BeautifulSoup select 可以正常工作，<a href>但是有这个网站以 pdf 链接开头，<data-url="https://example.org/abc/qwe.pdf">而 soup 无法捕捉到任何东西。

是否有任何代码可用于获取以“data-url”开头并以 .pdf 结尾的所有内容？

我为混乱的代码道歉。我还在学习。如果我能提供澄清，请告诉我。

谢谢 :D

csv看起来像这样

123456789 https://example.com

234567891 https://example2.com

import os
import requests
import csv
from urllib.parse import urljoin
from bs4 import BeautifulSoup

#Write csv into tuples
with open('links.csv') as f:
    url=[tuple(line) for line in csv.reader(f)]
print(url)

#If there is no such folder, the script will create one automatically
folder_location = r'C:\webscraping'
if not os.path.exists(folder_location):os.mkdir(folder_location)

def url_response(url):
    global i
    final = a
    response = requests.get(url)
    soup= BeautifulSoup(response.text, "html.parser")
    for link in soup.select("a[href$='.pdf']"):
        #Translating captured URLs into local addresses
        filename = os.path.join(folder_location,link['href'].split('/')[-1])
        print(filename)
        #Writing files into said addresses
        with open(filename, 'wb') as f:
            f.write(requests.get(urljoin(url,link['href'])).content)
        #Rename files
        os.rename(filename,str(final)+"_"+ str(i)+".pdf")
        i = i + 1

#Loop the csv
for a,b in url:
    i = 0
    url_response(b)
`

score 0 · Accepted Answer

是的，属性 = 值选择器，$ 以运算符结尾。它只是您现有的 href 选择器的另一种类型的属性

soup.select('[data-url$=".pdf"]')

结合 Or 语法

soup.select('[href$=".pdf"],[data-url$=".pdf"]')

然后，您可以使用 has_attr 进行测试，以确定对检索到的元素执行什么操作。

score 0 · Accepted Answer

如果 beautifulsoup 对您没有帮助，查找链接的正则表达式解决方案如下：

示例 HTML：

 txt = """
        <html>
        <body>
        <p>
        <data-url="https://example.org/abc/qwe.pdf">
        </p>
        <p>
        <data-url="https://example.org/def/qwe.pdf">
        </p>
        </html>
        """

用于提取内部链接的正则表达式代码data-url：

import re

re1 = '(<data-url=")' ## STARTS WITH
re2 = '((?:http|https)(?::\\/{2}[\\w]+)(?:[\\/|\\.]?)(?:[^\\s"]*))' # HTTP URL
re3 = '(">)' ## ENDS WITH

rg= re.compile(re1 + re2 + re3 ,re.IGNORECASE|re.DOTALL)
links = re.findall(rg, txt)

for i in range(len(links)):
    print(links[i][1])

输出：

https://example.org/abc/qwe.pdf
https://example.org/def/qwe.pdf

python-3.x - 网页抓取 - 非 href

2 回答 2

Related

Reference