2

我最近问了这个问题,但意识到我没有解释得很清楚。我有一个由发票组成的大型 .csv 文件(8000 多行),每张发票有多行。我将其解析为如下所示的 XML 结构(简化)。

输入 1 - $XMLInput

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <row>
        <invoiceNumber>1</invoiceNumber>
        <invoiceText>invoice 1-1</invoiceText>
        <position>1<position>
        ...
    </row>
    <row>
        <invoiceNumber>1</invoiceNumber>
        <invoiceText>invoice 1-2</invoiceText>
        <position>2<position>
        ...
    </row>
    <row>
        <invoiceNumber>2</invoiceNumber>
        <invoiceText>invoice 2-1</invoiceText>
        <position>3<position>
        ...
    </row>
    <row>
        <invoiceNumber>2</invoiceNumber>
        <invoiceText>invoice 2-2</invoiceText>
        <position>4<position>
        ...
    </row>
    <row>
        <invoiceNumber>3</invoiceNumber>
        <invoiceText>invoice 3-1</invoiceText>
        <position>5<position>
        ...
    </row>
    <row>
        <invoiceNumber>3</invoiceNumber>
        <invoiceText>invoice 3-2</invoiceText>
        <position>6<position>
        ...
    </row>
</roow>

输入 2 - $maxBatchSize 描述:在它变得大于这个大小(常量)后中断到下一个批次

输入 3 - $listOfInvoices 描述:文档中唯一发票编号的重复变量。例子:

<root>
    <row>
        <invoiceNumber>1</invoiceNumber>
    </row>
    <row>
        <invoiceNumber>2</invoiceNumber>
    </row>
    <row>
        <invoiceNumber>3</invoiceNumber>
    </row>
</root>

为了提高性能时间,我需要按 invoiceNumber 将这些元素分组,每个批次不超过 X 个节点(要导入的变量)。从那里我将每个批次并行发送到一个子处理器,而不是一次处理整个原始文档。例如,在上面的示例 XML 文档中,如果批量大小不能大于 3,我将需要以下 XML 输出:

输出 1 - $XMLOutput

<root>
    <batch>
        <row>
            <invoiceNumber>1</invoiceNumber>
            <invoiceText>invoice 1-1</invoiceText>
            <position>1<position>
            ...
        </row>
        <row>
            <invoiceNumber>1</invoiceNumber>
            <invoiceText>invoice 1-2</invoiceText>
            <position>2<position>
            ...
        </row>
        <row>
            <invoiceNumber>2</invoiceNumber>
            <invoiceText>invoice 2-1</invoiceText>
            <position>3<position>
            ...
        </row>
        <row>
            <invoiceNumber>2</invoiceNumber>
            <invoiceText>invoice 2-2</invoiceText>
            <position>4<position>
            ...
        </row>
    </batch>
    <batch>
        <row>
            <invoiceNumber>3</invoiceNumber>
            <invoiceText>invoice 3-1</invoiceText>
            <position>5<position>
            ...
        </row>
        <row>
            <invoiceNumber>3</invoiceNumber>
            <invoiceText>invoice 3-2</invoiceText>
            <position>6<position>
            ...
        </row>
    </batch>
</root>

要求发票的所有行在同一批次中发送。我最初的 XSLT 尝试低于 (2.0),我尝试模拟一个 while 循环,通过递归调用模板将发票组附加到当前节点。当达到最大批处理大小时,我递归调用批处理模板来创建一个新批处理。我在每个递归调用之间传递发票和批次计数器。

编辑:感谢肯的帮助,我越来越近了。我确实需要每次按行数而不是不同发票的数量来分解发票。理论上,如果以下内容有效,我不确定如何确保发票编号不存在于前一个兄弟节点中。

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0" xmlns:bpws="http://schemas.xmlsoap.org/ws/2003/03/business-process/" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:fn="http://www.w3.org/2005/xpath-functions" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
 <xsl:variable name="batch-size" select="40" as="xs:integer"/>
<xsl:variable name="input" select="bpws:getVariableData('sortedInvoicesByBU')"/>
<xsl:key name="invoice-lines-by-invoice-number" match="row" use="invoiceNumber4z"/>

<xsl:template match="/">
    <xsl:element name="batches">
        <!--establish batches from possible non-contiguous invoice numbers-->
        <xsl:for-each-group select="$input/*:UPSData/*:row" group-by="(position() - 1) idiv $batch-size">
            <xsl:for-each select="distinct-values($input/*:UPSData/*:row/*:invoiceNumber4z)[not(.=preceding-sibling::item)]">
                <xsl:element name="UPSData">
                    <xsl:for-each select="current()">
                        <xsl:for-each select="key('invoice-lines-by-invoice-number',.,$input)">
                            <!--copy rows as they are-->
                            <xsl:copy-of select="."/>
                        </xsl:for-each>
                    </xsl:for-each>
                </xsl:element>
            </xsl:for-each>
        </xsl:for-each-group>
    </xsl:element>
</xsl:template>
</xsl:stylesheet>
4

2 回答 2

4

我告诉我的学生,可以尽可能多地折磨样式表以使其最终正常工作,但这并不能使其可维护,甚至不能成为正确的做事方式。我希望您能接受这样的分析,即您将 XSLT 视为一种命令式编程语言,这种语言不公正,只会让您相信尝试做 C 和 Java 中更容易的事情是困难、冗长和尴尬的.

但是,如果您按照设计的方式使用 XSLT,它会比命令式语言更容易,并且启动它完全基于 XML,您可以在其中显示您想要的结果。因为它更短,所以更容易维护。当您了解所使用的声明性指令时,您不必尝试解开命令式算法。XSLT 处理器可以优化声明式方法,而如果它遵循书面命令式方法而没有机会对其进行优化,则不得不缓慢工作。

在下面的解决方案中,它准确地产生了您的 Output1 结果,请注意我如何确定唯一的发票编号,然后按有效的编号过滤它们。然后我根据批量大小(这是一个参数)对它们进行批量处理。没有调用模板,没有任何类型的计数器……使用 XSLT 2.0 的内置工具的解决方案。

并且不包括全局参数和变量的声明和注释,它只有 5 个元素长: <root><xsl:for-each-group><batch>和.<xsl:for-each><xsl:copy-of>

至于您的问题,为什么您的问题不起作用,我不知道……您采用的方法“感觉”不像 XSLT ……感觉像是某些编程命令式方法的 XSLT 表达式。

t:\ftemp>type numbers.xml 
<root>
    <row>
        <invoiceNumber>1</invoiceNumber>
    </row>
    <row>
        <invoiceNumber>2</invoiceNumber>
    </row>
    <row>
        <invoiceNumber>3</invoiceNumber>
    </row>
</root>

t:\ftemp>type invoices.xml 
<?xml version="1.0" encoding="UTF-8"?>
<root>
    <row>
        <invoiceNumber>1</invoiceNumber>
        <invoiceText>invoice 1-1</invoiceText>
        <position>1</position>
        ...
    </row>
    <row>
        <invoiceNumber>1</invoiceNumber>
        <invoiceText>invoice 1-2</invoiceText>
        <position>2</position>
        ...
    </row>
    <row>
        <invoiceNumber>2</invoiceNumber>
        <invoiceText>invoice 2-1</invoiceText>
        <position>3</position>
        ...
    </row>
    <row>
        <invoiceNumber>2</invoiceNumber>
        <invoiceText>invoice 2-2</invoiceText>
        <position>4</position>
        ...
    </row>
    <row>
        <invoiceNumber>3</invoiceNumber>
        <invoiceText>invoice 3-1</invoiceText>
        <position>5</position>
        ...
    </row>
    <row>
        <invoiceNumber>3</invoiceNumber>
        <invoiceText>invoice 3-2</invoiceText>
        <position>6</position>
        ...
    </row>
</root>

t:\ftemp>call xslt2 invoices.xml invoices.xsl 
<?xml version="1.0" encoding="UTF-8"?>
<root>
   <batch>
      <row>
        <invoiceNumber>1</invoiceNumber>
        <invoiceText>invoice 1-1</invoiceText>
        <position>1</position>
        ...
    </row>
      <row>
        <invoiceNumber>1</invoiceNumber>
        <invoiceText>invoice 1-2</invoiceText>
        <position>2</position>
        ...
    </row>
      <row>
        <invoiceNumber>2</invoiceNumber>
        <invoiceText>invoice 2-1</invoiceText>
        <position>3</position>
        ...
    </row>
      <row>
        <invoiceNumber>2</invoiceNumber>
        <invoiceText>invoice 2-2</invoiceText>
        <position>4</position>
        ...
    </row>
   </batch>
   <batch>
      <row>
        <invoiceNumber>3</invoiceNumber>
        <invoiceText>invoice 3-1</invoiceText>
        <position>5</position>
        ...
    </row>
      <row>
        <invoiceNumber>3</invoiceNumber>
        <invoiceText>invoice 3-2</invoiceText>
        <position>6</position>
        ...
    </row>
   </batch>
</root>

t:\ftemp>type invoices.xsl 
<?xml version="1.0" encoding="US-ASCII"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                version="2.0">

<xsl:output indent="yes"/>

<xsl:param name="batch-size" select="2"/>

<xsl:variable name="valid-numbers"
              select="doc('numbers.xml')/root/row/invoiceNumber"/>

<xsl:template match="/">
  <xsl:variable name="invoiceLines" select="root/row"/>
  <root>
    <!--establish batches from possible non-contiguous invoice numbers-->
    <xsl:for-each-group  group-by="(position() - 1) idiv $batch-size" 
      select="distinct-values($invoiceLines/invoiceNumber)[.=$valid-numbers]">
      <!--create a batch using all invoice lines for all numbers in group-->
      <batch>
        <xsl:for-each select="$invoiceLines[invoiceNumber=current-group()]">
          <!--copy rows as they are-->
          <xsl:copy-of select="."/>
        </xsl:for-each>
      </batch>
    </xsl:for-each-group>
  </root>
</xsl:template>

</xsl:stylesheet>
t:\ftemp>rem Done! 

我正在编辑此答案以在下面添加替代方案,因为您声明您有 800 万条输入记录,我认为使用键查找表会比我的简单变量谓词执行得更好。它在模板中使用一个额外的 XSLT 指令产生相同的结果(可以不添加它,但我觉得这更具可读性)并删除不再需要的变量。

<?xml version="1.0" encoding="US-ASCII"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                version="2.0">

<xsl:output indent="yes"/>

<xsl:param name="batch-size" select="2"/>

<xsl:variable name="valid-numbers"
              select="doc('numbers.xml')/root/row/invoiceNumber"/>

<xsl:key name="invoice-lines-by-invoice-number"
         match="row" use="invoiceNumber"/>

<xsl:variable name="input" select="/"/>

<xsl:template match="/">
  <root>
    <!--establish batches from possible non-contiguous invoice numbers-->
    <xsl:for-each-group  group-by="(position() - 1) idiv $batch-size" 
      select="distinct-values(root/row/invoiceNumber)[.=$valid-numbers]">
      <!--create a batch using all invoice lines for all numbers in group-->
      <batch>
        <xsl:for-each select="current-group()">
          <xsl:for-each
                     select="key('invoice-lines-by-invoice-number',.,$input)">
            <!--copy rows as they are-->
            <xsl:copy-of select="."/>
          </xsl:for-each>
        </xsl:for-each>
      </batch>
    </xsl:for-each-group>
  </root>
</xsl:template>

</xsl:stylesheet>
于 2013-08-23T01:29:37.480 回答
0

请不要将此标记为答案,因为我之前的答案回答了原始问题。

下面的代码回答了如何按发票总行数进行批处理的辅助问题,而不会破坏两批之间的发票。

我想不出一种以声明方式执行此操作的方法,因此下面的答案是一个命令式递归解决方案,但编写为实现尾递归的 XSLT 处理器不会占用堆栈空间。我还利用了在其他语言中难以模仿的原生 XSLT 特性(键表和序列)。

代码非常紧凑,只有一个部分实际写出了一批发票……没有更多的批量编写代码块。我很高兴结果如何。

我欢迎任何比这更严格的改进建议或替代解决方案的帖子。

t:\ftemp>type numbers.xml 
<root>
    <row>
        <invoiceNumber>1</invoiceNumber>
    </row>
    <row>
        <invoiceNumber>2</invoiceNumber>
    </row>
    <row>
        <invoiceNumber>3</invoiceNumber>
    </row>
    <row>
        <invoiceNumber>4</invoiceNumber>
    </row>
    <row>
        <invoiceNumber>5</invoiceNumber>
    </row>
</root>

t:\ftemp>type invoices.xml 
<?xml version="1.0" encoding="UTF-8"?>
<root>
    <row>
        <invoiceNumber>1</invoiceNumber>
        <invoiceText>invoice 1-1</invoiceText>
        <position>1</position>
        ...
    </row>
    <row>
        <invoiceNumber>1</invoiceNumber>
        <invoiceText>invoice 1-2</invoiceText>
        <position>2</position>
        ...
    </row>
    <row>
        <invoiceNumber>2</invoiceNumber>
        <invoiceText>invoice 2-1</invoiceText>
        <position>3</position>
        ...
    </row>
    <row>
        <invoiceNumber>2</invoiceNumber>
        <invoiceText>invoice 2-2</invoiceText>
        <position>4</position>
        ...
    </row>
    <row>
        <invoiceNumber>3</invoiceNumber>
        <invoiceText>invoice 3-1</invoiceText>
        <position>5</position>
        ...
    </row>
    <row>
        <invoiceNumber>3</invoiceNumber>
        <invoiceText>invoice 3-2</invoiceText>
        <position>6</position>
        ...
    </row>
    <row>
        <invoiceNumber>4</invoiceNumber>
        <invoiceText>invoice 4-1</invoiceText>
        <position>7</position>
        ...
    </row>
    <row>
        <invoiceNumber>4</invoiceNumber>
        <invoiceText>invoice 4-2</invoiceText>
        <position>8</position>
        ...
    </row>
    <row>
        <invoiceNumber>4</invoiceNumber>
        <invoiceText>invoice 4-3</invoiceText>
        <position>9</position>
        ...
    </row>
    <row>
        <invoiceNumber>4</invoiceNumber>
        <invoiceText>invoice 4-4</invoiceText>
        <position>10</position>
        ...
    </row>
    <row>
        <invoiceNumber>4</invoiceNumber>
        <invoiceText>invoice 4-5</invoiceText>
        <position>11</position>
        ...
    </row>
    <row>
        <invoiceNumber>4</invoiceNumber>
        <invoiceText>invoice 4-6</invoiceText>
        <position>12</position>
        ...
    </row>
    <row>
        <invoiceNumber>5</invoiceNumber>
        <invoiceText>invoice 5-1</invoiceText>
        <position>13</position>
        ...
    </row>
    <row>
        <invoiceNumber>5</invoiceNumber>
        <invoiceText>invoice 5-2</invoiceText>
        <position>14</position>
        ...
    </row>
</root>

t:\ftemp>call xslt2 invoices.xml invoices.xsl 
<?xml version="1.0" encoding="UTF-8"?>
<root>
  <!--Batch max lines: 5-->
  <batch>
    <!--invoice numbers: 1 2-->
    <!--total line count: 4-->
    <row>
        <invoiceNumber>1</invoiceNumber>
        <invoiceText>invoice 1-1</invoiceText>
        <position>1</position>
        ...
    </row>
      <row>
        <invoiceNumber>1</invoiceNumber>
        <invoiceText>invoice 1-2</invoiceText>
        <position>2</position>
        ...
    </row>
      <row>
        <invoiceNumber>2</invoiceNumber>
        <invoiceText>invoice 2-1</invoiceText>
        <position>3</position>
        ...
    </row>
      <row>
        <invoiceNumber>2</invoiceNumber>
        <invoiceText>invoice 2-2</invoiceText>
        <position>4</position>
        ...
    </row>
   </batch>
   <batch>
    <!--invoice numbers: 3-->
    <!--total line count: 2-->
    <row>
        <invoiceNumber>3</invoiceNumber>
        <invoiceText>invoice 3-1</invoiceText>
        <position>5</position>
        ...
    </row>
      <row>
        <invoiceNumber>3</invoiceNumber>
        <invoiceText>invoice 3-2</invoiceText>
        <position>6</position>
        ...
    </row>
   </batch>
   <batch>
    <!--invoice numbers: 4-->
    <!--total line count: 6-->
    <row>
        <invoiceNumber>4</invoiceNumber>
        <invoiceText>invoice 4-1</invoiceText>
        <position>7</position>
        ...
    </row>
      <row>
        <invoiceNumber>4</invoiceNumber>
        <invoiceText>invoice 4-2</invoiceText>
        <position>8</position>
        ...
    </row>
      <row>
        <invoiceNumber>4</invoiceNumber>
        <invoiceText>invoice 4-3</invoiceText>
        <position>9</position>
        ...
    </row>
      <row>
        <invoiceNumber>4</invoiceNumber>
        <invoiceText>invoice 4-4</invoiceText>
        <position>10</position>
        ...
    </row>
      <row>
        <invoiceNumber>4</invoiceNumber>
        <invoiceText>invoice 4-5</invoiceText>
        <position>11</position>
        ...
    </row>
      <row>
        <invoiceNumber>4</invoiceNumber>
        <invoiceText>invoice 4-6</invoiceText>
        <position>12</position>
        ...
    </row>
   </batch>
   <batch>
    <!--invoice numbers: 5-->
    <!--total line count: 2-->
    <row>
        <invoiceNumber>5</invoiceNumber>
        <invoiceText>invoice 5-1</invoiceText>
        <position>13</position>
        ...
    </row>
      <row>
        <invoiceNumber>5</invoiceNumber>
        <invoiceText>invoice 5-2</invoiceText>
        <position>14</position>
        ...
    </row>
   </batch>
</root>

t:\ftemp>type invoices.xsl 
<?xml version="1.0" encoding="US-ASCII"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                version="2.0">

<xsl:output indent="yes"/>

<xsl:param name="batch-size" select="5"/>

<xsl:variable name="valid-numbers"
              select="doc('numbers.xml')/root/row/invoiceNumber"/>

<xsl:key name="invoice-lines-by-invoice-number"
         match="row" use="invoiceNumber"/>

<xsl:variable name="input" select="/"/>

<xsl:template match="/">
  <root>
    <xsl:text>&#xa;  </xsl:text>
    <xsl:comment select="'Batch max lines:',$batch-size"/>
    <xsl:text>&#xa;  </xsl:text>
    <xsl:call-template name="next-batch">
      <xsl:with-param name="remaining-numbers" 
        select="distinct-values(root/row/invoiceNumber)[.=$valid-numbers]"/>
    </xsl:call-template>
  </root>
</xsl:template>

<xsl:template name="next-batch">
  <xsl:param name="this-batch-lines" select="0"/>
  <xsl:param name="this-batch-numbers" select="()"/>
  <xsl:param name="remaining-numbers" required="yes"/>
  <xsl:variable name="this-invoice" select="$remaining-numbers[1]"/>
  <xsl:variable name="this-invoice-lines"
  select="count(key('invoice-lines-by-invoice-number',$this-invoice,$input))"/>

  <xsl:choose>
    <xsl:when test="not($this-invoice) and not($this-batch-lines)">
      <!--nothing to clean up and nothing more to do-->
    </xsl:when>
    <xsl:when test="not($this-invoice) (:last invoice complete:) or
                    ( $this-batch-lines + $this-invoice-lines > $batch-size )
                      (:this invoice exceeds limit:)">
      <!--clean up previous unfinished batch-->
      <batch>
        <xsl:text>&#xa;    </xsl:text>
        <xsl:comment select="'invoice numbers:',$this-batch-numbers"/>
        <xsl:text>&#xa;    </xsl:text>
        <xsl:comment select="'total line count:',$this-batch-lines"/>
        <xsl:text>&#xa;    </xsl:text>
        <xsl:copy-of select="for $num in $this-batch-numbers return
                         key('invoice-lines-by-invoice-number',$num,$input)"/>
      </batch>
      <xsl:if test="$this-invoice">
        <!--continue with the next batch comprised of this invoice only-->
        <xsl:call-template name="next-batch">
          <xsl:with-param name="this-batch-lines"
                          select="$this-invoice-lines"/>
          <xsl:with-param name="this-batch-numbers"
                          select="$this-invoice"/>
          <xsl:with-param name="remaining-numbers" 
                          select="$remaining-numbers[position()>1]"/>
        </xsl:call-template>
      </xsl:if>
      <!--the cleaned up batch was the last batch, template recursion ends-->
    </xsl:when>
    <xsl:otherwise>
      <!--a batch limit has not been exceeded; add this invoice to batch-->
      <xsl:call-template name="next-batch">
        <xsl:with-param name="this-batch-lines"
                        select="$this-batch-lines + $this-invoice-lines"/>
        <xsl:with-param name="this-batch-numbers"
                        select="($this-batch-numbers,$this-invoice)"/>
        <xsl:with-param name="remaining-numbers"
                          select="$remaining-numbers[position()>1]"/>
      </xsl:call-template>
    </xsl:otherwise>
  </xsl:choose>
</xsl:template>

</xsl:stylesheet>
于 2013-09-02T01:27:57.853 回答