感谢您阅读我的帖子。我有一个 pdf 文件的 url 列表。
for eachurl in url_list:
print(eachurl)
以下是我的pdf的链接:
https://www.sec.gov/Archives/edgar/data/1005757/999999999715000035/filename1.pdf https://www.sec.gov/Archives/edgar/data/1037760/999999999715000162/filename1.pdf https:// www.sec.gov/Archives/edgar/data/1038133/999999999715000169/filename1.pdf https://www.sec.gov/Archives/edgar/data/1009626/999999999715000483/filename1.pdf https://www.sec. gov/Archives/edgar/data/1017491/999999999715000518/filename1.pdf https://www.sec.gov/Archives/edgar/data/1020214/999999999715000557/filename1.pdf https://www.sec.gov/Archives/埃德加/数据/1020214/999999999715000795/filename1.pdf
如果我手动单击它们并下载 pdf 文件,这七个链接可以完美运行。但是,如果我使用 python 代码下载它们,就会发生随机错误。有时,第一个 pdf 已损坏且无法打开。有时。它是第二个,或第三个,等等......
from pathlib import Path
import requests
n_files = 0
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169'}
for eachurl in url_list:
n_files += 1
response = requests.get(eachurl, headers=headers)
filename = Path(str(n_files) + '.pdf')
filename.write_bytes(response.content)
你能帮我理解为什么会这样吗?
更新:我将这些文件上传到谷歌驱动器,最后发现这是因为 SEC 将我识别为机器人。我已经添加了标题。知道如何绕过这个吗? 谷歌云端硬盘