xml - 在 XSLT 中批处理制表符分隔的文件

Question

我有一个包含 92 个制表符分隔的文本文件列表的 XML 文件：

<?xml version="1.0" encoding="UTF-8"?>
<dumpSet>
  <dump filename="file_one.txt"/>
  <dump filename="file_two.txt"/>
  <dump filename="file_three.txt"/>
  ...
</dumpSet>

每个文件的第一行包含后续行的字段名称。这只是一个例子。元素的名称和数量将因记录而异。大多数将有大约 50 个字段名称。

Title   Translated Title    Watch Video Interviewee Interviewer 
Interview with Barack Obama         Obama, Barack   Walters, Barbara
Interview with Sarah Palin          Palin, Sarah    Couric, Katie   Smith, John
...

Oxygen XML Editor 有一个 Import 功能，可以将文本文件转换为 XML，但是 - 据我所知 - 这不能在具有多个文件的批处理过程中完成。到目前为止，批处理部分还没有出现问题。我正在使用 XSLT 2.0 的unparsed-text()函数从列表中的文件中提取内容。但是，我正在努力正确地对 XML 输出进行分组。所需输出的示例：

<collection>
  <record>
    <title>Interview with Barack Obama</title>
    <translatedtitle></translatedtitle>
    <watchvideo></watchvideo>
    <interviewee>Obama, Barack</interviewee>
    <interviewer>Walters, Barbara</interviewer>
    <videographer>Smith, John</videographer>
  </record>
  <record>
    <title>Interview with Sarah Palin</title>
    <translatedtitle></translatedtitle>
    <watchvideo></watchvideo>
    <interviewee>Palin, Sarah</interviewee>
    <interviewer>Couric, Katie</interviewer>
    <videographer>Smith, John</videographer>
  </record>
  ...
</collection>

现在，这是我得到的输出：

<collection>
  <record>
    <title>title</title>
    <value>Interview with Barack Obama</value>
    <value>Interview with Sarah Palin</value>
    <translatedtitle>translatedtitle</translatedtitle>
    <value/>
    <value/>
    <watchvideo>watchvideo</watchvideo>
    <value/>
    <value/>
    <interviewee>interviewee</interviewee>
    <value>Obama, Barack</value>
    <value>Palin, Sarah</value>
    <interviewer>interviewer</interviewer>
    <value>Walters, Barbara</value>
    <value>Couric, Katie</value>
    <videographer>videographer</videographer>
    <value>Smith, John</value>
    <value>Smith, John </value>
    <value/>
    <value/>
  </record>
</collection>

也就是说，我无法按记录对输出进行分组。这是我正在使用的当前代码，基于 Doug Tidwell 的 XSLT 书中的一个示例：

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="#all" version="2.0">

    <xsl:param name="i" select="1"/>
    <xsl:param name="increment" select="1"/>
    <xsl:param name="operator" select="'&lt;='"/>
    <xsl:param name="testVal" select="100"/>    

    <xsl:template match="/">
        <collections>
            <collection>
                <xsl:for-each select="dumpSet/dump">

                    <!-- Pull in external tab-delimited files -->  
                    <xsl:for-each select="unparsed-text(concat('../2013-04-26/',@filename),'UTF-8')">
                        <record>

                            <!-- Call recursive template to loop through elements. -->
                            <xsl:call-template name="for-loop">
                                <xsl:with-param name="i" select="$i"/>
                                <xsl:with-param name="increment" select="$increment"/>
                                <xsl:with-param name="operator" select="$operator"/>
                                <xsl:with-param name="testVal" select="$testVal"/>
                            </xsl:call-template>
                        </record>
                    </xsl:for-each>
                </xsl:for-each>
            </collection>
        </collections>
    </xsl:template>

    <xsl:template name="for-loop">
        <xsl:param name="i"/>
        <xsl:param name="increment"/>
        <xsl:param name="operator"/>
        <xsl:param name="testVal"/>
        <xsl:variable name="testPassed">
            <xsl:choose>
                <xsl:when test="$operator = '&lt;='">
                    <xsl:if test="$i &lt;= $testVal">
                        <xsl:text>true</xsl:text>
                    </xsl:if>
                </xsl:when>
            </xsl:choose>
        </xsl:variable>
        <xsl:if test="$testPassed = 'true'">

            <!-- Separate the header from the tab-delimited file. -->
            <xsl:for-each select="tokenize(.,'\r|\n')[1]">

                <!-- Spit out the field names. -->
                <xsl:for-each select="tokenize(.,'\t')[$i]">
                    <xsl:element name="{replace(lower-case(translate(.,'-.','')),' ','')}">
                        <xsl:value-of select="replace(lower-case(translate(.,'-.','')),' ','')"/>
                    </xsl:element>
                </xsl:for-each>
            </xsl:for-each>

            <!-- For the following rows, loop through the field values. -->
            <xsl:for-each select="tokenize(.,'\r|\n')[position()&gt;1]">
                <xsl:for-each select="tokenize(.,'\t')[$i]">
                    <value>
                        <xsl:value-of select="."/>
                    </value>
                </xsl:for-each>
            </xsl:for-each>

            <!-- Call the template to increment. -->  
            <xsl:call-template name="for-loop">
                <xsl:with-param name="i" select="$i + $increment"/>
                <xsl:with-param name="increment" select="$increment"/>
                <xsl:with-param name="operator" select="$operator"/>
                <xsl:with-param name="testVal" select="$testVal"/>
            </xsl:call-template>
        </xsl:if>
    </xsl:template>
</xsl:stylesheet>

我应该如何将其更改为按记录对输出进行分组？

score 0 · Accepted Answer

请尝试使用此 XSLT 以了解如何满足您的需求。你需要在你需要的地方包含你的翻译功能。

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xs="http://www.w3.org/2001/XMLSchema" version="2.0">

  <xsl:output method="xml" indent="yes"/>

  <xsl:template match="/">
    <collections>
      <collection>
        <xsl:for-each select="dumpSet/dump">
          <xsl:for-each select="tokenize(unparsed-text(@filename,'UTF-8'),'\n')[not(position()=1)]">
            <record>
              <title><xsl:value-of select="tokenize(.,'\t')[1]"/></title>
              <translatedtitle><xsl:value-of select="tokenize(.,'\t')[2]"/></translatedtitle>
              <watchvideo><xsl:value-of select="tokenize(.,'\t')[3]"/></watchvideo>
              <interviewee><xsl:value-of select="tokenize(.,'\t')[4]"/></interviewee>
              <interviewer><xsl:value-of select="tokenize(.,'\t')[5]"/></interviewer>
              <videographer><xsl:value-of select="tokenize(.,'\t')[6]"/></videographer>
            </record>
          </xsl:for-each>
        </xsl:for-each>
      </collection>
    </collections>
  </xsl:template>

</xsl:stylesheet>

输出：

<collections xmlns:xs="http://www.w3.org/2001/XMLSchema">
   <collection>
      <record>
         <title>Interview with Barack Obama</title>
         <translatedtitle/>
         <watchvideo>Obama, Barack</watchvideo>
         <interviewee>Walters, Barbara</interviewee>
         <interviewer>&#xD;</interviewer>
         <videographer/>
      </record>
      <record>
         <title>Interview with Sarah Palin</title>
         <translatedtitle/>
         <watchvideo>Palin, Sarah</watchvideo>
         <interviewee>Couric, Katie</interviewee>
         <interviewer>Smith, John</interviewer>
         <videographer/>
      </record>
   </collection>
</collections>

score 0 · Accepted Answer

如果您使用xsl:analyze-string解析每条记录可能会更容易。可能有比我的示例更好的方法从标题中获取元素名称，但我没有时间考虑这个太久。

笔记：

您可能必须更改unparsed-text(). 我通常将编码作为参数传递，因此我不必修改样式表。也许可以将编码添加到<dump/>？

unparsed-text-available()使用来查看文件是否存在以及是否可以使用指定的编码读取将是一个好主意。

此外，您可能需要检查以确保标头中的值是有效的 QName。例如，如果标题中有撇号，则会出现错误。也许使用标题中的字段名称作为属性值而不是元素名称会更好。（如<field name="Interviewee">Obama, Barack</field>：）

这是我的例子：

XML 输入

<dumpSet>
  <dump filename="file_one.txt"/>
</dumpSet>

file_one.txt

Title   Translated Title    Watch Video Interviewee Interviewer Videographer
Interview with Barack Obama         Obama, Barack   Walters, Barbara
Interview with Sarah Palin          Palin, Sarah    Couric, Katie   Smith, John

XSLT 2.0

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output indent="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="dumpSet">
        <collection>
            <xsl:apply-templates select="dump[@filename]"/>
        </collection>
    </xsl:template>

    <xsl:template match="dump">
        <xsl:variable name="text" select="unparsed-text(@filename, 'iso-8859-1')"/>
        <xsl:variable name="header">
            <xsl:analyze-string select="$text" regex="(..*)">
                <xsl:matching-substring>
                    <xsl:if test="position()=1">
                        <xsl:value-of select="regex-group(1)"/>
                    </xsl:if>                   
                </xsl:matching-substring>
            </xsl:analyze-string>
        </xsl:variable>
        <xsl:variable name="headerTokens" select="tokenize($header,'\t')"/>
        <xsl:analyze-string select="$text" regex="(..*)">
            <xsl:matching-substring>
                <xsl:if test="not(position()=1)">
                    <record>
                        <xsl:analyze-string select="." regex="([^\t][^\t]*)\t?|\t">
                            <xsl:matching-substring>
                                <xsl:variable name="pos" select="position()"/>
                                <xsl:element name="{replace(normalize-space(lower-case($headerTokens[$pos])),' ','')}">
                                    <xsl:value-of select="normalize-space(regex-group(1))"/>                            
                                </xsl:element>                              
                            </xsl:matching-substring>
                        </xsl:analyze-string>
                    </record>
                </xsl:if>
            </xsl:matching-substring>
        </xsl:analyze-string>
    </xsl:template>

</xsl:stylesheet>

输出

<collection>
   <record>
      <title>Interview with Barack Obama</title>
      <translatedtitle/>
      <watchvideo/>
      <interviewee>Obama, Barack</interviewee>
      <interviewer>Walters, Barbara</interviewer>
   </record>
   <record>
      <title>Interview with Sarah Palin</title>
      <translatedtitle/>
      <watchvideo/>
      <interviewee>Palin, Sarah</interviewee>
      <interviewer>Couric, Katie</interviewer>
      <videographer>Smith, John</videographer>
   </record>
</collection>

xml - 在 XSLT 中批处理制表符分隔的文件

2 回答 2

Related

Reference