该数据是使用 javascript 动态加载的。有很多关于抓取此类页面的信息(请参见此处的许多示例之一);在这种情况下,以下内容应该可以帮助您:
import requests
import json
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:93.0) Gecko/20100101 Firefox/93.0',
'Accept': 'application/json, text/javascript, */*; q=0.01',
}
data = '{"q":"ex10","category":"custom","forms":["10-K","10-Q","8-K"],"startdt":"2020-10-08","enddt":"2021-10-08"}'
#obvioulsy, you need to change "startdt" and "enddt" as necessary
response = requests.post('https://efts.sec.gov/LATEST/search-index', headers=headers, data=data)
响应为 json 格式。您的网址隐藏在那里:
data = json.loads(response.text)
hits = data['hits']['hits']
for hit in hits:
cik = hit['_source']['ciks'][0]
file_data = hit['_id'].split(":")
filing = file_data[0].replace('-','')
file_name = file_data[1]
url = f'https://www.sec.gov/Archives/edgar/data/{cik}/{filing}/{file_name}'
print(url)
输出:
https://www.sec.gov/Archives/edgar/data/0001372183/000158069520000415/ex10-5.htm
https://www.sec.gov/Archives/edgar/data/0001372183/000138713120009670/ex10-5.htm
https://www.sec.gov/Archives/edgar/data/0001540615/000154061520000006/ex10.htm
https://www.sec.gov/Archives/edgar/data/0001552189/000165495421004948/ex10-1.htm
等等