选择div
上面的元素并使用nextSibling
:
from bs4 import BeautifulSoup
html = '<div id="product-short-summary-wrap">\
<b class="tip-anchor tip-anchor-wrap">Short summary description Toshiba Satellite Pro C850-1GR</b>ev\
:\
<br/>\
<div class="tooltip-text">This short summary of the data-sheet.</div>\
Toshiba Satellite Pro C850-1GR Satellite Pro, 1.8 GHz\
</div>'
soup = BeautifulSoup(html)
text = soup.find("div", {"class":"tooltip-text"})
print text.nextSibling.string
输出:
Toshiba Satellite Pro C850-1GR Satellite Pro, 1.8 GHz
如果 div 里面有This short summary of the data-sheet
,那么你可以试试这个:
from bs4 import BeautifulSoup
html = '<div id="product-short-summary-wrap">\
<b class="tip-anchor tip-anchor-wrap">Short summary description Toshiba Satellite Pro C850-1GR</b>ev\
:\
<br/>\
<div class="tooltip-text">This short summary of the data-sheet.</div>\
Toshiba Satellite Pro C850-1GR Satellite Pro, 1.8 GHz\
</div>'
soup = BeautifulSoup(html)
text = soup.find("div", {"class":"tooltip-text"})
if "This short summary of the data-sheet." in text.string:
print text.nextSibling.string
输出:
Toshiba Satellite Pro C850-1GR Satellite Pro, 1.8 GHz
我认为您在 PasteBin 中发布了错误的 HTML,但我找到了您要废弃的站点。我不确定到底是哪一页,所以这就是我找到并完成的。如果您访问此页面,您可以找到与您的问题相同的 HTML 部分。我提取文本的代码:
import urllib2
from bs4 import BeautifulSoup
url = "http://icecat.biz/p/toshiba/pscbxe-01t01gfr/satellite-pro-notebooks-4051528036589-C8501GR-17411822.html"
html = urllib2.urlopen(url)
soup = BeautifulSoup(html)
texts = soup.findAll("div", {"class":"tooltip-text"})
for text in texts:
if text.string:
if "This short summary of the" in text.string:
print text.nextSibling.string.strip()
输出:
Toshiba C850-1GR Satellite Pro, 1.8 GHz, Intel Celeron, 1000M, 4 GB, DDR3-SDRAM, 1600 MHz
不同的URL也是一样的,输出:
Intel H2312WPFJR, Socket R (2011), Intel, Xeon, 2048 GB, DDR3-SDRAM, 2048 GB
如果您需要,您可以在找到它后拆分字符串