0

我们正在尝试使用 Python 解析 SEC Edgar 文件。我正在尝试在第 21 行获取此表“按业务部门划分的销售额”。这是文档的链接。

https://www.sec.gov/ix?doc=/Archives/edgar/data/200406/000020040621000057/jnj-20210704.htm

下面是我们在网上找到的代码。网页中的所有数据都在这个标签下。

<div id="dynamic-xbrl-form" class="position-relative">

我们无法打印此数据。

在此处输入图像描述

from bs4 import BeautifulSoup
import requests
import sys

# Access page
cik = '200406'
type = '10-K'
dateb = '20210704'

# Obtain HTML for search page
base_url = "https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK={}&type={}&dateb={}"
edgar_resp = requests.get(base_url.format(cik, type, dateb))
edgar_str = edgar_resp.text

# Find the document link
doc_link = ''
soup = BeautifulSoup(edgar_str, 'html.parser')
print(soup)

任何人都可以帮助我们得到这个。任何建议都是有帮助的。

4

2 回答 2

1

首先

您需要以正确的方式使用 f 字符串,阅读这篇文章以了解有关 f 字符串的更多信息

正确的代码:

# Obtain HTML for search page
base_url = f"https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK={cik}&type={type}&dateb={dateb}"
edgar_resp = requests.get(base_url)

第二

响应对象返回 403 表示被禁止访问,您可以阅读这篇文章以了解更多关于状态码的信息,要解决此问题,您需要用户代理标头

代码应该是这样的:

from bs4 import BeautifulSoup
import requests
import sys
from fake_useragent import UserAgent

ua = UserAgent()
headers = ua.random

# Access page
cik = '200406'
type = '10-K'
dateb = '20210704'

# Obtain HTML for search page
base_url = f"https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK={cik}&type={type}&dateb={dateb}"
edgar_resp = requests.get(base_url)

print (edgar_resp)
edgar_str = edgar_resp.text

# Find the document link
doc_link = ''
soup = BeautifulSoup(edgar_str, 'html.parser')
# now you can use BeautifulSoup to find you data 
#print(soup)

您需要安装 fake-useragent 库

pip install fake-useragent

您可以阅读此主题以了解有关 fake-useragent 的更多信息

之后,您可以使用 beutifulsoup 提取您需要的数据,您可以阅读这篇文章以了解更多信息。

于 2021-09-02T07:25:35.240 回答
1

您提到的 URL 是动态页面。但是,页面内容是从此静态页面加载的。

https://www.sec.gov/Archives/edgar/data/200406/000020040621000057/jnj-20210704.htm

您可以抓取此页面并提取数据。

这是抓取您需要的数据的代码。

from bs4 import BeautifulSoup
import requests


headers = {"User-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"}
# Obtain HTML for search page
base_url = "https://www.sec.gov/Archives/edgar/data/200406/000020040621000057/jnj-20210704.htm"
edgar_resp = requests.get(base_url, headers=headers)
edgar_str = edgar_resp.text


soup = BeautifulSoup(edgar_str, 'html.parser')
s =  soup.find('span', recursive=True, string='SALES BY SEGMENT OF BUSINESS ')
t = s.find_next('table')
trs = t.find_all('tr')
for tr in trs:
    if tr.text:
        print(list(tr.stripped_strings))
['Fiscal Second Quarter Ended', 'Fiscal Six Months Ended']
['(Dollars in Millions)', 'July 4,', '2021', 'June 28,', '2020', 'Percent', 'Change', 'July 4,', '2021', 'June 28,', '2020', 'Percent Change']
['Consumer Health']
['OTC']
['U.S.', '$', '675', '627', '7.7', '%', '$', '1,274', '1,316', '(', '3.2', ')', '%']
['International', '633', '522', '21.2', '1,208', '1,181', '2.3']
['Worldwide', '1,307', '1,149', '13.8', '2,482', '2,497', '(', '0.6', ')']
['Skin Health/Beauty']
['U.S.', '659', '536', '23.0', '1,293', '1,195', '8.2']
['International', '511', '471', '8.4', '1,040', '929', '12.0']
['Worldwide', '1,170', '1,007', '16.2', '2,333', '2,124', '9.8']
['Oral Care']
['U.S.', '165', '170', '(', '3.1', ')', '328', '346', '(', '5.2', ')']
['International', '260', '227', '14.6', '514', '446', '15.3']
['Worldwide', '426', '397', '7.0', '843', '792', '6.3']
['Baby Care']
['U.S.', '97', '96', '0.8', '193', '188', '2.4']
['International', '290', '260', '11.5', '583', '529', '10.2']
['Worldwide', '387', '356', '8.6', '776', '717', '8.1']
["Women's Health"]
['U.S.', '3', '3', '(', '3.1', ')', '6', '7', '(', '16.0', ')']
['International', '227', '199', '14.2', '446', '427', '4.5']
['Worldwide', '230', '202', '13.9', '452', '434', '4.2']
['Wound Care/Other']
['U.S.', '153', '126', '20.9', '268', '245', '9.3']
['International', '64', '59', '7.3', '125', '111', '12.1']
['Worldwide', '216', '185', '16.6', '393', '356', '10.2']
['TOTAL', 'Consumer Health']
['U.S.', '1,751', '1,557', '12.4', '3,362', '3,297', '2.0']
['International', '1,984', '1,739', '14.1', '3,916', '3,624', '8.1']
['Worldwide', '3,735', '3,296', '13.3', '7,278', '6,921', '5.2']
于 2021-09-02T07:24:49.707 回答