使用 'contents' ,然后替换<br>
?
这是一个完整的(工作的,经过测试的)示例:
from bs4 import BeautifulSoup
import urllib2
url="http://www.floris.us/SO/bstest.html"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
result = soup.find(attrs={'class':'myclass'})
print "The result of soup.find:"
print result
print "\nresult.contents:"
print result.contents
print "\nresult.get_text():"
print result.get_text()
for r in result:
if (r.string is None):
r.string = ' '
print "\nAfter replacing all the 'None' with ' ':"
print result.get_text()
结果:
The result of soup.find:
<span class="myclass">Lorem ipsum<br/>dolor sit amet,<br/>consectetur...</span>
result.contents:
[u'Lorem ipsum', <br/>, u'dolor sit amet,', <br/>, u'consectetur...']
result.get_text():
Lorem ipsumdolor sit amet,consectetur...
After replacing all the 'None' with ' ':
Lorem ipsum dolor sit amet, consectetur...
这比 Sean 的非常紧凑的解决方案更复杂——但既然我说过我会按照我在可能的情况下指出的路线创建和测试一个解决方案,我决定兑现我的承诺。你可以更好地看到这里发生了什么——<br/>
它是元组中自己的元素result.contents
,但是当转换为字符串时,“什么都没有”。