我正在尝试使用 Python 将 html 块转换为文本。
输入:
<div class="body"><p><strong></strong></p>
<p><strong></strong>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Consectetuer adipiscing elit. <a href="http://example.com/" target="_blank" class="source">Some Link</a> Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p></div>
期望的输出:
Lorem ipsum dolor sit amet,consectetuer adipiscing elit。Aenean commodo ligula eget dolor。埃涅马萨
Consectetuer adipiscing 精英。一些 Link Aenean commodo ligula eget dolor。埃涅马萨
Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit。Aenean commodo ligula eget dolor。埃涅马萨
Lorem ipsum dolor sit amet,consectetuer adipiscing elit。Aenean commodo ligula eget dolor。埃涅马萨
Consectetuer adipiscing 精英。Aenean commodo ligula eget dolor。埃涅马萨
我尝试了该html2text
模块但没有取得多大成功:
#!/usr/bin/env python
import urllib2
import html2text
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://example.com/page.html').read())
txt = soup.find('div', {'class' : 'body'})
print(html2text.html2text(txt))
该txt
对象生成上面的 html 块。我想将其转换为文本并在屏幕上打印。