html - 使用 BeautifulSoup 从 Span 标签中提取文本

Question

我正在尝试从此网址中提取“$ 1,773”的估计每月费用：

https://www.zillow.com/homedetails/4651-Genoa-St-Denver-CO-80249/13274183_zpid/

在检查页面的该部分后，我看到了以下数据：

<div class="sc-qWfCM cdZDcW">
   <span class="Text-c11n-8-48-0__sc-aiai24-0 dQezUG">Estimated monthly cost</span>
   <span class="Text-c11n-8-48-0__sc-aiai24-0 jLucLe">$1,773</span></div>

为了提取 1,773 美元，我尝试了以下方法：

from bs4 import BeautifulSoup
import requests

url = 'https://www.zillow.com/homedetails/4651-Genoa-St-Denver-CO-80249/13274183_zpid/'
headers = {"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0"}

soup = BeautifulSoup(requests.get(url, headers=headers).content, "html")

print(soup.findAll('span', {'class': 'Text-c11n-8-48-0__sc-aiai24-0 jLucLe'}))

这将返回一个包含三个元素的列表，没有提及 $1,773。

[<span class="Text-c11n-8-48-0__sc-aiai24-0 jLucLe">$463,300</span>, 
<span class="Text-c11n-8-48-0__sc-aiai24-0 jLucLe">$1,438</span>, 
<span class="Text-c11n-8-48-0__sc-aiai24-0 jLucLe">$2,300<!-- -->/mo</span>]

有人可以解释如何返回 1,773 美元吗？

score 1 · Accepted Answer

在解析网页时，我们需要以呈现方式分离页面的组件。有些组件是静态或动态渲染的。动态内容也需要一些时间来加载，因为页面需要某种后端 API。

在这里阅读更多

我尝试使用 Selenium ChromeDriver 解析您的页面

import time

from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()
driver.get("https://www.zillow.com/homedetails/4651-Genoa-St-Denver-CO-80249/13274183_zpid/")
time.sleep(3)
time.sleep(3)
el = driver.find_elements_by_xpath("//span[@class='Text-c11n-8-48-0__sc-aiai24-0 jLucLe']")

for e in el:
    print(e.text)

time.sleep(3)
driver.quit()

#OUTPUT
$463,300
$1,773
$2,300/mo

score 1 · Accepted Answer

我认为你必须找到第一个父元素。例如：

parent_div = soup.find('div', {'class': 'sc-fzqBZW bzsmsC'})
result = parent_div.findAll('span', {'class': 'Text-c11n-8-48-0__sc-aiai24-0 jLucLe'})

html - 使用 BeautifulSoup 从 Span 标签中提取文本

2 回答 2

Related

Reference