html - 在 HTML 文件中查找特定标签

Question

我有一些 html 文件，想提取一些标签之间的内容：页面的标题 some tagged content here。

<p>A paragraph comes here</p>
<p>A paragraph comes here</p><span class="more-about">Some text here</span><p class="en-cpy">Copyright &copy; 2012 </p>

我只想要这些标签：head, p 但正如在第二段中可以看到的，最后一个标签是以 p 开头但不是我的欲望标签，我不想要它的内容。我使用以下脚本来提取我想要的文本，但我无法过滤掉我的示例中的最后一个标签等标签.... 怎么可能只提取<p>标签？

grep "<p>" $File | sed -e 's/^[ \t]*//'

我必须补充一点，最后一个标签（我不想出现在输出中）就在我想要的标签之一之后（就像在我的示例中一样），并且使用 grep 命令将返回该行的所有内容作为输出......（这是我的问题）

score 3 · Accepted Answer

不。试图用它regex来解析HTML会很痛苦。使用类似Rubyand的东西Nokogiri，或您熟悉的类似语言 + 库。

score 0 · Accepted Answer

xmllint --html --xpath "//*[name()='head' or name()='p']" "$file"

如果您正在处理损坏的 HTML，您可能需要一个不同的解析器。这是一个基本相同的“单线”，使用lxml. 只需将脚本传递给您的 URL

#!/usr/bin/env python3
from lxml import etree
import sys

print('\n'.join(etree.tostring(x, encoding="utf-8", with_tail=False).decode("utf-8") for x in (lambda i: etree.parse(i, etree.HTMLParser(remove_blank_text=1, remove_comments=1)).xpath("//*[name()='p' or name()='head']"))(sys.argv[0])))

score 0 · Accepted Answer

提取 <p> 和 </p> 之间的文本，试试这个

perl -ne 'BEGIN{$/="</p>";$\="\n"}s/.*(<p>)/$1/&&print' < input-file > output-file

或者

perl -n0l012e 'print for m|<p>.*?</p>|gs'

html - 在 HTML 文件中查找特定标签

3 回答 3

Related

Reference