xml - 如何从 XML 文件中删除带有样式属性的 DIV 标签？

Question

我有一个巨大的 Wordpress XML 导出。不幸的是，一些混蛋设法将代码注入到安装中，并将 DIV 注入到内容中。现在我想清理那个烂摊子。这是它的样子：

<p>Normal Text</p>
<div style="position:absolute;top:-9660px;left:-4170px;"><a href="http://insane.link.com">Insane Linktext</a></div>
<div style="position:absolute;top:-2460px;left:-5370px;"><a href="http://insane.link.com">Another Insane Linktext</a></div>
<p>Normal good people's brains' text</p>

我考虑过使用一些正则表达式来匹配包含 STYLE 属性的 DIV。可用的工具是 Aptana 或其他文本编辑器和 PHP 服务器以及 OSX 终端。对此有何建议？

谢谢和干杯！

score 2 · Accepted Answer

我建议不要使用正则表达式，而是使用真正的 XML 解析器。例如，由于您在 OS X 上，因此已经安装了 Ruby，您可以使用以下命令清理 HTML：

require 'nokogiri'                      # Use `sudo gem install nokogiri` first
html = Nokogiri.HTML(IO.read(ARGV[0]))  # read and parse the HTML document
html.css('div[style]').remove           # destroy all <div style="...">...</div>
File.open(ARGV[1],'w'){ |f| f << html } # write the html to disk as a new file

您首先需要根据评论安装 Nokogiri。

然后，将上面的内容另存为“clean_divs.rb”，然后输入ruby clean_divs.rb my.html my_fixed.html（其中第一个是要读取的文件名，第二个是要写入的文件名）。

如果您想在销毁时更精确，可以使用 XPath 选择要销毁的元素，例如html.xpath('//div[@style][a]').remove仅查找具有样式属性和<a>直接子元素的 div。

score 0 · Accepted Answer

这可能会对您有所帮助：它将与您在上面提供的 div 匹配：

<div style="[a-zA-Z0-9-:;]+"><a href="[a-z:/.]+">[a-zA-Z ]+</a></div>

但是，它只会匹配一个div > a > text模式，并且只匹配具有 style 属性的 div，没有别的。

您应该能够使用大多数 HTML 编辑器进行查找和替换（Dreamweaver 和 Notepad++ 都允许）

score 0 · Accepted Answer

您可以对这些元素使用带有空模板的修改后的身份转换<div>来删除它们：

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>

    <!--default processing for content is to copy forward -->
    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>

    <!--remove the rogue div elements -->
    <xsl:template match="div[@style]" />

</xsl:stylesheet>

xml - 如何从 XML 文件中删除带有样式属性的 DIV 标签？

3 回答 3

Related

Reference