0

我正在尝试将 HTML 页面的内容保存在 .html 文件中,但我只想将内容保存在标签“table”下。此外,我想删除所有空标签,如<b></b>. 我已经用 BeautifulSoup 做了所有这些事情:

f = urllib2.urlopen('http://test.xyz')
html = f.read()
f.close()
soup = BeautifulSoup(html)

txt = ""

for text in soup.find_all("table", {'class': 'main'}):
txt += str(text)

text = BeautifulSoup(text)
empty_tags = text.find_all(lambda tag: tag.name == 'b' and tag.find(True) is None and (tag.string is None or tag.string.strip()=="")) 
[empty_tag.extract() for empty_tag in empty_tags]

我的问题是:lxml 也可以吗?如果是:这个 +/- 会是什么样子?非常感谢您的帮助。

4

1 回答 1

3
import lxml.html

# lxml can download pages directly
root = lxml.html.parse('http://test.xyz').getroot()

# use a CSS selector for class="main",
# or use root.xpath('//table[@class="main"]')
tables = root.cssselect('table.main')

# extract HTML content from all tables
# use lxml.html.tostring(t, method="text", encoding=unicode)
# to get text content without tags
"\n".join([lxml.html.tostring(t) for t in tables])

# removing only specific empty tags, here <b></b> and <i></i>
for empty in root.xpath('//*[self::b or self::i][not(node())]'):
    empty.getparent().remove(empty)

# removing all empty tags (tags that do not have children nodes)
for empty in root.xpath('//*[not(node())]'):
    empty.getparent().remove(empty)
# root does not contain those empty tags anymore
于 2013-08-25T22:52:54.223 回答