我正在尝试从最新的SEC EDGAR 附表 13 表格文件中提取信息。
以备案链接为例:
我试图提取的信息(以及包含该信息的文件的部分)
1) 申报人姓名:Saba Capital Management, LP
<p style="margin-bottom: 0pt;">NAME OF REPORTING PERSON</p>
<p style="margin-top: 0pt; margin-left: 18pt;">Saba Capital Management GP, LLC<br><br/>
2) 发行人名称:WESTERN ASSET HIGH INCOME FUND II INC
<p style="text-align: center;"><b><font size="5"><u>WESTERN ASSET HIGH INCOME FUND II INC.</u></font><u><br/></u>(Name of Issuer)</b>
3) CUSIP号码:95766J102(设法得到)
<p style="text-align: center;"><b><u>95766J102<br/></u>(CUSIP Number)</b>
4) 以金额表示的班级百分比:11.3%(设法获得)
<p style="margin-bottom: 0pt;">PERCENT OF CLASS REPRESENTED BY AMOUNT IN ROW (11)</p>
<p style="margin-top: 0pt; margin-left: 18pt;">11.3%<br><br/>
5) 需要提交本声明的事件日期:2019 年 12 月 24 日
<p style="text-align: center;"><b><u>December 24, 2019<br/></u>(Date of Event Which Requires Filing of This Statement)</b>
.
import requests
import re
from bs4 import BeautifulSoup
page = requests.get('https://www.sec.gov/Archives/edgar/data/1058239/000106299319004848/formsc13da.htm')
soup = BeautifulSoup(page.text, 'xml')
## get CUSIP number
CUSIP = re.findall(r'[0-9]{3}[a-zA-Z0-9]{2}[a-zA-Z0-9*@#]{3}[0-9]', soup.text)
### get %
regex = r"(?<=PERCENT OF CLASS|Percent of class)(.*)(?=%)"
percent = re.findall(r'\d+.\d+', re.search(regex, soup.text, re.DOTALL).group().split('%')[0])
如何从归档中提取这 5 条信息?提前致谢