python - 使用 lxml (python) 解析 HTML

Question

我正在尝试将 HTML 页面的内容保存在 .html 文件中，但我只想将内容保存在标签“table”下。此外，我想删除所有空标签，如<b></b>. 我已经用 BeautifulSoup 做了所有这些事情：

f = urllib2.urlopen('http://test.xyz')
html = f.read()
f.close()
soup = BeautifulSoup(html)

txt = ""

for text in soup.find_all("table", {'class': 'main'}):
txt += str(text)

text = BeautifulSoup(text)
empty_tags = text.find_all(lambda tag: tag.name == 'b' and tag.find(True) is None and (tag.string is None or tag.string.strip()=="")) 
[empty_tag.extract() for empty_tag in empty_tags]

我的问题是：lxml 也可以吗？如果是：这个 +/- 会是什么样子？非常感谢您的帮助。

score 3 · Accepted Answer

import lxml.html

# lxml can download pages directly
root = lxml.html.parse('http://test.xyz').getroot()

# use a CSS selector for class="main",
# or use root.xpath('//table[@class="main"]')
tables = root.cssselect('table.main')

# extract HTML content from all tables
# use lxml.html.tostring(t, method="text", encoding=unicode)
# to get text content without tags
"\n".join([lxml.html.tostring(t) for t in tables])

# removing only specific empty tags, here <b></b> and <i></i>
for empty in root.xpath('//*[self::b or self::i][not(node())]'):
    empty.getparent().remove(empty)

# removing all empty tags (tags that do not have children nodes)
for empty in root.xpath('//*[not(node())]'):
    empty.getparent().remove(empty)
# root does not contain those empty tags anymore

python - 使用 lxml (python) 解析 HTML

1 回答 1

Related

Reference