python - 使用 BeautifulSoup 清理和删除标签

Question

到目前为止，我有以下脚本：

from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import re
import urllib2

br = Browser()
br.open("http://www.foo.com")

html = br.response().read(); 

soup = BeautifulSoup(html)
items = soup.findAll(id="info")

它运行完美，并产生以下“项目”：

<div id="info">
<span class="customer"><b>John Doe</b></span><br>
123 Main Street<br>
Phone:5551234<br>
<b><span class="paid">YES</span></b>
</div>

但是，我想拿东西并清理它以获得

John Doe
123 Main Street
5551234

如何在 BeautifulSoup 和 Python 中删除这些标签？

一如既往，谢谢！

score 1 · Accepted Answer

这将针对此 EXACT html 执行此操作。显然，这不能容忍任何偏差，因此您需要添加相当多的边界检查和空值检查，但这里有一些细节可以将您的数据转换为纯文本。

items = soup.findAll(id="info")
print items[0].span.b.contents[0]
print items[0].contents[3].strip()
print items[0].contents[5].strip().split(":", 1)[1]

python - 使用 BeautifulSoup 清理和删除标签

1 回答 1

Related

Reference