0

我正在尝试使用 Beautiful Soup 来获取 EDGAR 上 10K SEC 文件的“属性”部分中的文本。

我可以让 Properties 部分标题正常,并沿着父节点向上工作,但从那里 next_sibling 方法没有识别下一个兄弟(在这种情况下,我相信它包含该部分中的第一段文本)。有人可以告诉我为什么这不起作用/如何解决?

代码:

import requests
from bs4 import BeautifulSoup

url = 'https://www.sec.gov/Archives/edgar/data/1318605/000156459020004475/tsla-10k_20191231.htm'
soup = BeautifulSoup(requests.get(url).content, 'lxml')

properties_header = soup.find_all('p', text="PROPERTIES")[0]

print(properties_header.parent.parent.parent.parent.next_sibling)

预期结果:

<p style="margin-top:4pt;margin-bottom:0pt;text-indent:5.24%;font-family:Times New Roman;font-size:10pt;font-weight:normal;font-style:normal;text-transform:none;font-variant: normal;">We are headquartered in Palo Alto, California. Our principal facilities include a large number of properties in North America, Europe and Asia utilized for manufacturing and assembly, warehousing, engineering, retail and service locations, Supercharger sites, and administrative and sales offices. Our facilities are used to support both of our reporting segments, and are suitable and adequate for the conduct of our business. We primarily lease such facilities with the exception of some manufacturing facilities. The following table sets forth the location of our primary owned and leased manufacturing facilities.</p>
4

1 回答 1

0

第一个 next_sibling 是一个 NavigableString。在 next_sibling 上加倍以到达以下 p。

print(properties_header.parent.parent.parent.parent.next_sibling.next_sibling)
于 2020-10-21T18:01:32.773 回答