xml - 删除 XML 中的所有 HTML

Question

我正在尝试将一些 XML 提供给 Apache Solr，但是一些 XML 在文本中包含一些 HTML 格式，这些格式不允许我发布到我的 solr 服务器。显然，能够保留这些信息会很好，因为我的文档可以在发布之前进行预格式化。但是我没有看到或不知道转义是否会避免 solr 的 HTML 问题。我的问题很热门，我是否使用 XSLT 从 XML 中删除 HTML？

例如：

What I have:

<field name="description"><h1>This is a description of a doc!</h1><p> This doc contains some information</p></field>

What I need:

<field name="description">This is a description of a doc! This doc contains some information.</field>

我想要一个智能修复，而不是在 xsl 翻译期间不擦洗的特定标签的黑名单。这将是低效的，因为如果一个人决定创建一个带有say标签的新文档，黑名单将不会看到这一点，除非程序员手动添加它。

我尝试将 HTML 标记转换为 html 实体（< 和 &gr; 分别用于 < 和 >），但是当我尝试通过 BasicNameValuePairs 通过 HtmlPost 发布此内容时，这会搞砸事情。我不想使用这些实体。

任何想法StackOverflow？

score 2 · Accepted Answer

如果您知道包含 HTML 的元素，则可以匹配任何这些元素的后代并执行应用模板。

例子...

XML 输入

<field name="description"><h1>This is a <b>description</b> of a doc!</h1><!--Here's a comment--><p> This doc contains some information</p></field>

XSLT 1.0

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" indent="yes"/>

    <xsl:template match="node()|@*">
        <xsl:copy>
            <xsl:apply-templates select="node()|@*"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="node()[ancestor::field and not(self::text())]">
        <xsl:apply-templates/>
    </xsl:template>

</xsl:stylesheet>

XML 输出

<field name="description">This is a description of a doc! This doc contains some information</field>

xml - 删除 XML 中的所有 HTML

1 回答 1

Related

Reference