python - REGEX 从 EDGAR SC-13 表格中提取信息

Question

我正在尝试从最新的SEC EDGAR 附表 13 表格文件中提取信息。

以备案链接为例：

1)萨巴资本_2019 年 12 月 27 日_SC13

我试图提取的信息（以及包含该信息的文件的部分）

1) 申报人姓名：Saba Capital Management, LP

<p style="margin-bottom: 0pt;">NAME OF REPORTING PERSON</p>
<p style="margin-top: 0pt; margin-left: 18pt;">Saba Capital Management GP, LLC<br><br/>

2) 发行人名称：WESTERN ASSET HIGH INCOME FUND II INC

<p style="text-align: center;"><b><font size="5"><u>WESTERN ASSET HIGH INCOME FUND II INC.</u></font><u><br/></u>(Name of Issuer)</b>

3) CUSIP号码：95766J102（设法得到）

<p style="text-align: center;"><b><u>95766J102<br/></u>(CUSIP Number)</b>

4) 以金额表示的班级百分比：11.3%（设法获得）

<p style="margin-bottom: 0pt;">PERCENT OF CLASS REPRESENTED BY AMOUNT IN ROW (11)</p>
<p style="margin-top: 0pt; margin-left: 18pt;">11.3%<br><br/>

5) 需要提交本声明的事件日期：2019 年 12 月 24 日

<p style="text-align: center;"><b><u>December 24, 2019<br/></u>(Date of Event Which Requires Filing of This Statement)</b>

.

import requests 
import re
from bs4 import BeautifulSoup

page = requests.get('https://www.sec.gov/Archives/edgar/data/1058239/000106299319004848/formsc13da.htm')
soup = BeautifulSoup(page.text, 'xml')

## get CUSIP number
CUSIP = re.findall(r'[0-9]{3}[a-zA-Z0-9]{2}[a-zA-Z0-9*@#]{3}[0-9]', soup.text)

### get % 
regex = r"(?<=PERCENT OF CLASS|Percent of class)(.*)(?=%)"
percent = re.findall(r'\d+.\d+', re.search(regex, soup.text, re.DOTALL).group().split('%')[0])

如何从归档中提取这 5 条信息？提前致谢

score 1 · Accepted Answer

尝试以下代码获取所有值。使用find() 和 css 选择器select_one()

import requests
import re
from bs4 import BeautifulSoup

page = requests.get('https://www.sec.gov/Archives/edgar/data/1058239/000106299319004848/formsc13da.htm')
soup = BeautifulSoup(page.text, 'lxml')
NameReportingperson=soup.find('p', text=re.compile('NAME OF REPORTING PERSON')).find_next('p').text.strip()
print(NameReportingperson)
NameOftheIssuer=soup.select_one('p:nth-child(7) > b u').text.strip()
print(NameOftheIssuer)
CUSIP=soup.select_one("p:nth-child(9) > b > u").text.strip()
print(CUSIP)
percentage=soup.find('p', text=re.compile('PERCENT OF CLASS REPRESENTED BY AMOUNT IN ROW')).find_next('p').text.strip()
print(percentage)
Dateof=soup.select_one("p:nth-child(11) > b > u").text.strip()
print(Dateof)

输出：

Saba Capital Management, L.P.
WESTERN ASSET HIGH INCOME FUND II INC.
95766J102
11.3%
December 24, 2019

更新

如果您不想使用位置，请尝试以下一种。

import requests
import re
from bs4 import BeautifulSoup

page = requests.get('https://www.sec.gov/Archives/edgar/data/1058239/000106299319004848/formsc13da.htm')
soup = BeautifulSoup(page.text, 'lxml')
NameReportingperson=soup.find('p', text=re.compile('NAME OF REPORTING PERSON')).find_next('p').text.strip()
print(NameReportingperson)
NameOftheIssuer=soup.select_one('p:contains(Issuer)').find_next('u').text.strip()
print(NameOftheIssuer)
CUSIP=soup.select_one('p:contains(CUSIP)').find_next('u').text.strip()
print(CUSIP)
percentage=soup.find('p', text=re.compile('PERCENT OF CLASS REPRESENTED BY AMOUNT IN ROW')).find_next('p').text.strip()
print(percentage)
Dateof=soup.select_one('p:contains(Event)').find_next('u').text.strip()
print(Dateof)

输出：

Saba Capital Management, L.P.
WESTERN ASSET HIGH INCOME FUND II INC.
95766J102
11.3%
December 24, 2019

更新 2：

import requests
import re
from bs4 import BeautifulSoup
page = requests.get('https://www.sec.gov/Archives/edgar/data/1058239/000106299319004848/formsc13da.htm')
soup = BeautifulSoup(page.text, 'lxml')
NameReportingperson=soup.find('p', text=re.compile('NAME OF REPORTING PERSON')).find_next('p').text.strip()
print(NameReportingperson)
NameOftheIssuer=soup.select_one('p:nth-of-type(7) > b u').text.strip()
print(NameOftheIssuer)
CUSIP=soup.select_one("p:nth-of-type(9) > b > u").text.strip()
print(CUSIP)
percentage=soup.find('p', text=re.compile('PERCENT OF CLASS REPRESENTED BY AMOUNT IN ROW')).find_next('p').text.strip()
print(percentage)
Dateof=soup.select_one("p:nth-of-type(11) > b > u").text.strip()
print(Dateof)

score 0 · Accepted Answer

使用 lxml，它应该以这种方式工作：

import requests
import lxml.html

url = 'https://www.sec.gov/Archives/edgar/data/1058239/000106299319004848/formsc13da.htm'
source = requests.get(url)

doc = lxml.html.fromstring(source.text)
name = doc.xpath('//p[text()="NAME OF REPORTING PERSON"]/following-sibling::p/text()')[0]
issuer = doc.xpath('//p[contains(text(),"(Name of Issuer)")]//u/text()')[0]
cusip = doc.xpath('//p[contains(text(),"(CUSIP Number)")]//u/text()')[0]
perc = doc.xpath('//p[contains(text(),"PERCENT OF CLASS REPRESENTED")]/following-sibling::p/text()')[0]
event = doc.xpath('//p[contains(text(),"(Date of Event Which Requires")]//u/text()')[0]

输出：

Saba Capital Management, L.P.
WESTERN ASSET HIGH INCOME FUND II INC.
95766J102
11.3%
December 24, 2019

python - REGEX 从 EDGAR SC-13 表格中提取信息

2 回答 2

Related

Reference