python - 提取 HTML 文件的内容

Question

我有一个看起来像这样（简化）的 HTML 文件：

<table class="main" frame="box" rules="all" border="0" cellspacing="0" cellpadding="0" style="table-layout:fixed; width:325.68pt; height:528.96pt;">
Here is some text.
<table class="main" frame="box" rules="all" border="0" cellspacing="0" cellpadding="0" style=" width:50.88pt; height:77.28pt;">
Here is another text which ends right here.
</table>
Here are also some words...
</table>

我想提取的是“table class="main"”的内容，所以用明确的话来说，我想提取与上面写的相同的内容到文件中。考虑：这个例子被简化了；在 -tags 周围，还有很多其他的......我尝试使用以下代码提取内容：

root = lxml.html.parse('www.test.xyz').getroot()

for empty in root.xpath('//*[self::b or self::i][not(node())]'):
    empty.getparent().remove(empty)

tables = root.cssselect('table.main')

上面的代码有效。但问题是我得到了两次。明白我的意思：代码的结果是：

<table class="main" frame="box" rules="all" border="0" cellspacing="0" cellpadding="0" style="table-layout:fixed; width:325.68pt; height:528.96pt;">
Here is some text.
<table class="main" frame="box" rules="all" border="0" cellspacing="0" cellpadding="0" style=" width:50.88pt; height:77.28pt;">
Here is another text which ends right here.
</table>
Here are also some words...
</table>
<table class="main" frame="box" rules="all" border="0" cellspacing="0" cellpadding="0" style=" width:50.88pt; height:77.28pt;">
Here is another text which ends right here.
</table>

所以问题是中间部分最后出现的次数太多了。为什么会这样以及如何省略和修复？

paul t.，也是一个 stackoverflow 用户，告诉我使用“root.xpath('//table[@class="main" 而不是(.//table[@class="main"])]')” . 这段代码准确地打印出我有两次的部分。

我希望问题描述得足够清楚......感谢任何帮助和任何建议:)

score 1 · Accepted Answer

您想要选择所有尚未被选为相同元素的后代的“main”类的表。
这似乎工作正常：

root.xpath('//table[@class="main" and not(ancestor::table[@class="main"])]')

python - 提取 HTML 文件的内容

1 回答 1

Related

Reference