python - Replacing Selections of HTML via the Command-Line

Question

EDIT: I know how to do this. I'm not looking for a solution, I'm looking for a process or existing program recommendation before I take the time to write something myself in some scripting language.

I have some HTML files in various directories which all have a similar structure:

<html>
    <head>...</head>
    <body>
        <nav>...</nav>
        <section>...</section>
    </body>
</html>

I'd like to programmatically replace HTML sections with other sections (e.g. replace the <nav> block with a different nav block [specified in a file of my choosing]) for all the files I specify.

I think the ideal solution would be some sort of tool using lxml or something similar in Python, but if there were an easy way to do it with *nixy tools, or an existing program to do this, I'd be happy to do that instead of putting together a script.

score 3 · Accepted Answer

您也许可以像这样使用 BeautifulSoup for Python。

import BeautifulSoup

soup = BeautifulSoup.BeautifulSoup(htmldata)
nav = soup.find("nav")
nav.name = "new name"

例如：

import BeautifulSoup

html_data = "<nav>Some text</nav>"
soup = BeautifulSoup.BeautifulSoup(html_data)
nav = soup.find("nav")
nav.name = "nav2"

将改变：<nav></nav>到<nav2></nav2>

score 3 · Accepted Answer

不要使用正则表达式或字符串解析。那些只会让你头疼。使用解析器。

在 Ruby 中，我会使用 Nokogiri：

require 'nokogiri'

html = '
<html>
  <body>
    <nav>...</nav>
    <section>...</section>
  </body>
</html>
'
doc = Nokogiri::HTML(html)

nav = doc.at('nav').content = "this is a new block"
puts doc.to_html

哪个输出：

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
    <nav>this is a new block</nav><section>...</section>
</body></html>

当然你想"this is a new block"用类似的东西替换File.read('snippet.html')。

如果您的替换文件包含 HTML 片段而不是nav内容，请改用：

nav = doc.at('nav').replace('<nav>this is a new block</nav>')

输出将是相同的。（同样，File.read如果您是这样倾斜的，请使用从文件中获取它。）

在 Nokogiri 中，at查找由 CSS 或 XPath 访问器指定的标记的第一个实例并返回Node.js。我在上面使用了 CSS，但//nav也可以。at猜测访问器的类型。您可以使用at_cssorat_xpath如果您想具体一点，因为可能有不明确的访问器。此外，Nokogiri 有search，它返回一个 NodeSet，它的作用类似于一个数组。您可以按照自己的意愿迭代结果。并且，分别at有 CSS 和 XPath 特定版本。cssxpath

Nokogiri 有一个 CLI 界面，对于像这个例子这样简单的东西，它可以工作，但我也可以在 sed 或 Ruby/Perl/Python 单线器中完成。

curl -s http://nokogiri.org | nokogiri -e'p $_.css("h1").length'

尽管 HTML 很少这么简单，尤其是在野外漫游的任何东西，CLI 或单行解决方案将迅速失控，或者干脆死掉。我说，基于多年编写许多爬虫和 RSS 聚合器的经验——当你引入一个额外的 HTML 或 XML 源时，一开始很简单的事情会变得更加复杂，而且它永远不会变得更容易。使用解析器教会我先去找他们。

score 2 · Accepted Answer

我最终编写了自己的小命令行工具来做我想做的事。它适用于我的用例，我打算随着时间的推移对其进行改进。它在 GitHub 上：trufflepig。

我希望它也可以对其他人有用。

python - Replacing Selections of HTML via the Command-Line

3 回答 3

Related

Reference