html - 从混合的 xml 和 HTML 中仅复制 HTML

Question

我们有一堆文件是 html 页面，但其中包含额外的 xml 元素（都以我们的公司名称“TLA”为前缀）为我现在正在重写的旧程序提供数据和结构。

示例表格：

<html >
<head>
    <title>Highly Simplified Example Form</title>
</head>
<body>
    <TLA:document xmlns:TLA="http://www.tla.com">
        <TLA:contexts>
            <TLA:context id="id_1" value=""></TLA:context>
        </TLA:contexts>
        <TLA:page>
            <TLA:question id="q_id_1">
                <table>
                    <tr>
                        <td>
                            <input id="input_id_1" type="text" />
                        </td>
                    </tr>
                </table>
            </TLA:question>
        </TLA:page>
        <!-- Repeat many times -->
    </TLA:document>
</body>
</html>

我的任务是编写一个预处理器，它将只复制 html 元素，并将它们的属性和内容完整地复制到一个新文件中。

像这样：

<html >
<head>
    <title>Highly Simplified Example Form</title>
</head>
<body>
    <table>
        <tr>
            <td>
                <input id="input_id_1" type="text" />
            </td>
        </tr>
    </table>
    <!-- Repeat many times -->
</body>
</html>

我采用了使用 XSLT 的方法，因为这是我为不同文件提取 TLA 元素所需的方法。到目前为止，这是我拥有的 XSLT：

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:msxsl="urn:schemas-microsoft-com:xslt" exclude-result-prefixes="msxsl"
    xmlns:mbl="http://www.mbl.com">
  <xsl:output method="xml" indent="yes"/>
  <xsl:strip-space elements="*" />
  <xsl:template match="mbl:* | mbl:*/@* | mbl:*/text()"/>
  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>    
</xsl:stylesheet>

但是，这只产生以下内容：

<html >
<head>
    <title>Highly Simplified Example Form</title>
</head>
<body>
</body>
</html>

如您所见，TLA:document 元素中的所有内容都被排除在外。需要在 XSLT 中进行哪些更改以获取所有 html 但过滤掉 TLA 元素？

或者，有没有更简单的方法来解决这个问题？我知道几乎每个浏览器都会忽略 TLA 元素，那么有没有办法使用 HTML 工具或应用程序来获得我需要的东西？

score 1 · Accepted Answer

专门针对 HTML 元素会很困难，但如果您只想从 TLA 命名空间中排除内容（但仍包括 TLA 元素包含的任何非 TLA 元素），那么这应该可行：

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:mbl="http://www.tla.com" exclude-result-prefixes="mbl">
  <xsl:output method="xml" indent="yes"/>
  <xsl:strip-space elements="*" />

  <xsl:template match="@*|node()" priority="-2">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <!-- This element-only identity template prevents the 
       TLA namespace declaration from being copied to the output -->
  <xsl:template match="*">
    <xsl:element name="{name()}">
      <xsl:apply-templates select="@* | node()" />
    </xsl:element>
  </xsl:template>

  <!-- Pass processing on to child elements of TLA elements -->  
  <xsl:template match="mbl:*">
    <xsl:apply-templates select="*" />
  </xsl:template>
</xsl:stylesheet>

如果您想排除任何具有任何非空命名空间的内容，也可以使用它：

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:mbl="http://www.tla.com" exclude-result-prefixes="mbl">
  <xsl:output method="xml" indent="yes"/>
  <xsl:strip-space elements="*" />

  <xsl:template match="@*|node()" priority="-2">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="*">
    <xsl:element name="{name()}">
      <xsl:apply-templates select="@* | node()" />
    </xsl:element>
  </xsl:template>

  <xsl:template match="*[namespace-uri()]">
    <xsl:apply-templates select="*" />
  </xsl:template>
</xsl:stylesheet>

在您的示例输入上运行任何一个时，结果是：

<html>
  <head>
    <title>Highly Simplified Example Form</title>
  </head>
  <body>
    <table>
      <tr>
        <td>
          <input id="input_id_1" type="text" />
        </td>
      </tr>
    </table>
  </body>
</html>

html - 从混合的 xml 和 HTML 中仅复制 HTML

1 回答 1

Related

Reference