python - 在 Python 中使用 BeautifulSoup 访问链接 href

Question

我正在寻找使用 BeautifulSoup 对 SEC 的 EDGAR 数据库进行网络抓取的帮助。我有一份投资公司名称列表，我正在尝试遍历这些名称，并最终访问他们的 13F 文件。

到目前为止，使用 BeautifulSoup，我能够指定一个条目，但我无法找到将 SEC 的基本 Web url 与特定文件组合在一起以实际访问数据的方法。

到目前为止，我的代码如下所示：

headers = {"user-agent": 'Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0'}

for i in firms: # pre-determined list, but using IFP Advisors for this example as 'i'
    edgar_url = r'https://www.sec.gov/cgi-bin/srch-edgar?text=form-type%3D13F-HR+and+company-name+%3D+%22' + i + '%22&first=2020&last=2021&output=atom'
    
    response = requests.get(url = edgar_url, headers = headers)
    soup = BeautifulSoup(response.content, 'lxml')
    entries = soup.find_all('entry')

这让我看到了一个特定的 13F 归档条目列表。

   <entry>
      <title>13F-HR - IFP Advisors, Inc</title>
      <link rel="alternate" type="text/html" href="/Archives/edgar/data/1641866/000164186621000007/0001641866-21-000001-index.htm"/>
      <summary type="html">&lt;b&gt;Filed Date:&lt;/b&gt; 01/25/2021 &lt;b&gt;Accession Number:&lt;/b&gt; 0001641866-21-000001 &lt;b&gt;Size:&lt;/b&gt; 4 MB</summary>
      <updated>01/25/2021</updated>
      <category scheme="http://www.sec.gov/" label="form type" term="4"/>
      <id>urn:tag:sec.gov,2008:accession-number=0001641866-21-000001</id>
   </entry>

最终，我想做的是拉出上面规定的href

/Archives/edgar/data/1641866/000164186621000007/0001641866-21-000007-index

并将其与条目中的方案配对以访问 13F 归档的文本文件，该文件可在此处找到：https ://www.sec.gov/Archives/edgar/data/1641866/000164186620000007/0001641866-20-000007.txt

虽然我已经指定了方案，但我正在寻找一种解决方案来从每个条目中提取链接 href 以创建一个新的 url 来访问更多数据。

任何帮助或建议将不胜感激。先感谢您！

score 1 · Accepted Answer

要获取完整提交的 URL，您可以使用以下示例：

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0"
}

firms = [
    "IFP Advisors, Inc",
]

entries = []
for i in firms:
    edgar_url = (
        r"https://www.sec.gov/cgi-bin/srch-edgar?text=form-type%3D13F-HR+and+company-name+%3D+%22"
        + i
        + "%22&first=2020&last=2021&output=atom"
    )
    response = requests.get(url=edgar_url, headers=headers)
    soup = BeautifulSoup(response.content, "lxml")
    entries.extend(soup.find_all("entry"))

for e in entries:
    url = "https://www.sec.gov" + e.link["href"]
    print("Getting URL:", url)
    soup = BeautifulSoup(
        requests.get(url, headers=headers).content, "html.parser"
    )
    l = soup.select_one(
        'td:-soup-contains("Complete submission text file") + td a'
    )
    submission_url = "https://www.sec.gov" + l["href"]
    print("Complete submission text file:", submission_url)
    print()

印刷：

Getting URL: https://www.sec.gov/Archives/edgar/data/1641866/000164186621000005/0001641866-21-000005-index.htm
Complete submission text file: https://www.sec.gov/Archives/edgar/data/1641866/000164186621000005/0001641866-21-000005.txt

Getting URL: https://www.sec.gov/Archives/edgar/data/1641866/000164186621000004/0001641866-21-000004-index.htm
Complete submission text file: https://www.sec.gov/Archives/edgar/data/1641866/000164186621000004/0001641866-21-000004.txt

Getting URL: https://www.sec.gov/Archives/edgar/data/1641866/000164186621000001/0001641866-21-000001-index.htm
Complete submission text file: https://www.sec.gov/Archives/edgar/data/1641866/000164186621000001/0001641866-21-000001.txt

Getting URL: https://www.sec.gov/Archives/edgar/data/1641866/000164186620000007/0001641866-20-000007-index.htm
Complete submission text file: https://www.sec.gov/Archives/edgar/data/1641866/000164186620000007/0001641866-20-000007.txt

Getting URL: https://www.sec.gov/Archives/edgar/data/1641866/000164186620000006/0001641866-20-000006-index.htm
Complete submission text file: https://www.sec.gov/Archives/edgar/data/1641866/000164186620000006/0001641866-20-000006.txt

Getting URL: https://www.sec.gov/Archives/edgar/data/1641866/000164186620000002/0001641866-20-000002-index.htm
Complete submission text file: https://www.sec.gov/Archives/edgar/data/1641866/000164186620000002/0001641866-20-000002.txt

Getting URL: https://www.sec.gov/Archives/edgar/data/1641866/000164186620000001/0001641866-20-000001-index.htm
Complete submission text file: https://www.sec.gov/Archives/edgar/data/1641866/000164186620000001/0001641866-20-000001.txt

python - 在 Python 中使用 BeautifulSoup 访问链接 href

1 回答 1

Related

Reference