python - 通过标题标签名称搜索标题标签的内容

Question

我正在抓取一个页面，我必须从这种格式中获取员工数量：

<h5>Number of Employees</h5>
<p>
            20
</p>

我需要得到数字“20”问题是这个数字并不总是在同一个标题中，有时在“h4”中并且有更多的“h5”标题，所以我需要找到包含的数据在名为：“员工人数”的标题中，并提取包含段落中的数字

这是页面的链接

http://www.bbb.org/chicago/business-reviews/paving-contractors/lester-s-material-service-inc-in-grayslake-il-72000434/

score 1 · Accepted Answer

好吧，最简单的方法是找到一个包含“员工人数”文本的元素，然后简单地取其后的段落，假设该段落总是紧随其后。

这是执行此操作的快速而肮脏的代码，并打印出数字：

parent = soup.find("div", id='business-additional-info-text')
for child in parent.children:
    if("Number of Employees" in child):
        print(child.findNext('p').contents[0].strip())

score 0 · Accepted Answer

'normalize-space(//*[self::h4 or self::h5][contains(., "Number of Employees")]/following-sibling::p[1]/text())'

python - 通过标题标签名称搜索标题标签的内容

2 回答 2

Related

Reference