python - 为什么这个 Beautiful Soup 代码不能解析我的目标文本？

Question

我正在尝试在这个 10K 文件中选择“属性”部分的标题；一旦从那里选择，我打算抓取该部分中的文本（即“属性”和“法律诉讼”部分标题之间的所有文本。

当我运行下面的代码时，我得到 IndexError 'list index out of range' 但我不明白为什么，因为文本“PROPERTIES”似乎在一个“p”标签内。我也尝试过使用 'id="ITEM_2_PROPERTIES"' 而不是 text= 但这也不起作用

我哪里错了？

import requests
from bs4 import BeautifulSoup


url = 'https://www.sec.gov/ix?doc=/Archives/edgar/data/1318605/000156459020004475/tsla-10k_20191231.htm'
soup = BeautifulSoup(requests.get(url).content, 'lxml')

properties_header = soup.find_all('p', text="PROPERTIES")[0]

print(properties_header)

score 1 · Accepted Answer

这是因为您正在向呈现的站点发出请求，所以textJS没有这样的请求。pPROPERTIES

但是，如果您更改目标 URL，则会出现以下情况：

import requests
from bs4 import BeautifulSoup


url = 'https://www.sec.gov/Archives/edgar/data/1318605/000156459020004475/tsla-10k_20191231.htm'
soup = BeautifulSoup(requests.get(url).content, 'lxml')

properties_header = soup.find_all('p', text="PROPERTIES")

print(properties_header)

输出：

[<p id="ITEM_2_PROPERTIES" style="margin-bottom:0pt;margin-top:0pt;font-weight:bold;font-style:normal;text-transform:none;font-variant: normal;font-family:Times New Roman;font-size:10pt;">PROPERTIES</p>]

我从开发者工具中获得了新的目标 URL。当您JS重新打开时，就会出现这种情况。所以，我想你应该为你未来的请求定位那个 URL。

python - 为什么这个 Beautiful Soup 代码不能解析我的目标文本？

1 回答 1

Related

Reference