-1

我是 Python 中 Web 抓取的新手,并尝试从 SEC Edgar 全文搜索中抓取所有 htm 文档链接。我可以在模态页脚中看到链接,但 BeautifulSoup 不会使用链接解析 href 元素。

是否有一个简单的解决方案来解析文档的链接?

网站 HTML 代码中的链接快照

url = 'https://www.sec.gov/edgar/search/#/q=ex10&category=custom&forms=10-K%252C10-Q%252C8-K'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
    
for a in soup.find_all(id = "open-file"):
    print(a)
4

1 回答 1

1

该数据是使用 javascript 动态加载的。有很多关于抓取此类页面的信息(请参见此处的许多示例之一);在这种情况下,以下内容应该可以帮助您:

import requests
import json
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:93.0) Gecko/20100101 Firefox/93.0',
    'Accept': 'application/json, text/javascript, */*; q=0.01',   
}

data = '{"q":"ex10","category":"custom","forms":["10-K","10-Q","8-K"],"startdt":"2020-10-08","enddt":"2021-10-08"}'
#obvioulsy, you need to change "startdt" and "enddt" as necessary
response = requests.post('https://efts.sec.gov/LATEST/search-index', headers=headers, data=data)

响应为 json 格式。您的网址隐藏在那里:

data = json.loads(response.text)
hits = data['hits']['hits']
for hit in hits:
    cik = hit['_source']['ciks'][0]
    file_data = hit['_id'].split(":")
    filing = file_data[0].replace('-','')
    file_name = file_data[1]
    url = f'https://www.sec.gov/Archives/edgar/data/{cik}/{filing}/{file_name}'
    print(url)

输出:

https://www.sec.gov/Archives/edgar/data/0001372183/000158069520000415/ex10-5.htm
https://www.sec.gov/Archives/edgar/data/0001372183/000138713120009670/ex10-5.htm
https://www.sec.gov/Archives/edgar/data/0001540615/000154061520000006/ex10.htm
https://www.sec.gov/Archives/edgar/data/0001552189/000165495421004948/ex10-1.htm

等等

于 2021-10-08T14:47:17.487 回答