我有一个 csv 中的网站列表,我想在其上捕获所有 pdf。
BeautifulSoup select 可以正常工作,<a href>
但是有这个网站以 pdf 链接开头,<data-url="https://example.org/abc/qwe.pdf">
而 soup 无法捕捉到任何东西。
是否有任何代码可用于获取以“data-url”开头并以 .pdf 结尾的所有内容?
我为混乱的代码道歉。我还在学习。如果我能提供澄清,请告诉我。
谢谢 :D
csv看起来像这样
123456789 https://example.com
234567891 https://example2.com
import os
import requests
import csv
from urllib.parse import urljoin
from bs4 import BeautifulSoup
#Write csv into tuples
with open('links.csv') as f:
url=[tuple(line) for line in csv.reader(f)]
print(url)
#If there is no such folder, the script will create one automatically
folder_location = r'C:\webscraping'
if not os.path.exists(folder_location):os.mkdir(folder_location)
def url_response(url):
global i
final = a
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
#Translating captured URLs into local addresses
filename = os.path.join(folder_location,link['href'].split('/')[-1])
print(filename)
#Writing files into said addresses
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url,link['href'])).content)
#Rename files
os.rename(filename,str(final)+"_"+ str(i)+".pdf")
i = i + 1
#Loop the csv
for a,b in url:
i = 0
url_response(b)
`