2

我正在使用 XSLT(XSLT 2.0 很好)将 XML (TEI) 转换为可读的纯文本(有一些小的修改/挑战——为诗歌保留空间;使标题全部大写)。

到目前为止,一切都按照我的意愿工作,但为了便于阅读,我还想将通过这种转换输出的一行文本的长度限制为某个值(如 80 个字符宽),仅在空格上拆分(不分词等)。我想设置输出的最大长度(或者说,80 个字符),不仅仅是输出第一个,比如 80 个字符。

有人对最佳方法有建议吗?匹配所有text()然后使用 XSLT 的内置字符串函数的模板是要走的路吗?我试图想象使用字符串函数(string-lengthsubstring或类似函数)来做到这一点,但还没有任何运气。

(我可以很容易地使用python脚本单独执行此操作,所以也许“事后做”可能是最好的答案。我很想知道我是否忽略了一个简单的解决方案。)

4

1 回答 1

6

一、这是我十多年前写的一个解决方案。

此转换(来自 FXSL 库):

<xsl:stylesheet version="2.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:f="http://fxsl.sf.net/"
 xmlns:str-split2lines-func="f:str-split2lines-func"
 exclude-result-prefixes="f str-split2lines-func">

   <xsl:import href="str-foldl.xsl"/>
   <xsl:output method="text"/>

   <str-split2lines-func:str-split2lines-func/>

    <xsl:template match="/">
      <xsl:call-template name="str-split-to-lines">
        <xsl:with-param name="pStr" select="/*"/>
        <xsl:with-param name="pLineLength" select="64"/>
        <xsl:with-param name="pDelimiters" select="' &#9;&#10;&#13;'"/>
      </xsl:call-template>
    </xsl:template>

    <xsl:template name="str-split-to-lines">
      <xsl:param name="pStr"/>
      <xsl:param name="pLineLength" select="60"/>
      <xsl:param name="pDelimiters" select="' &#9;&#10;&#13;'"/>

      <xsl:variable name="vsplit2linesFun"
                    select="document('')/*/str-split2lines-func:*[1]"/>

      <xsl:variable name="vrtfParams">
       <delimiters><xsl:value-of select="$pDelimiters"/></delimiters>
       <lineLength><xsl:copy-of select="$pLineLength"/></lineLength>
      </xsl:variable>

      <xsl:variable name="vResult">
          <xsl:call-template name="str-foldl">
            <xsl:with-param name="pFunc" select="$vsplit2linesFun"/>
            <xsl:with-param name="pStr" select="$pStr"/>
            <xsl:with-param name="pA0" select="$vrtfParams"/>
          </xsl:call-template>
      </xsl:variable>

      <xsl:for-each select="$vResult/line">
        <xsl:for-each select="word">
          <xsl:value-of select="concat(., ' ')"/>
        </xsl:for-each>
        <xsl:value-of select="'&#10;'"/>
      </xsl:for-each>
    </xsl:template>

    <xsl:template match="str-split2lines-func:*" mode="f:FXSL">
      <xsl:param name="arg1" select="/.."/>
      <xsl:param name="arg2"/>

      <xsl:copy-of select="$arg1/*[position() &lt; 3]"/>
      <xsl:copy-of select="$arg1/line[position() != last()]"/>

      <xsl:choose>
        <xsl:when test="contains($arg1/*[1], $arg2)">
          <xsl:if test="string($arg1/word)">
             <xsl:call-template name="fillLine">
               <xsl:with-param name="pLine" select="$arg1/line[last()]"/>
               <xsl:with-param name="pWord" select="$arg1/word"/>
               <xsl:with-param name="pLineLength" select="$arg1/*[2]"/>
             </xsl:call-template>
          </xsl:if>
        </xsl:when>
        <xsl:otherwise>
          <xsl:copy-of select="$arg1/line[last()]"/>
          <word><xsl:value-of select="concat($arg1/word, $arg2)"/></word>
        </xsl:otherwise>
      </xsl:choose>
    </xsl:template>

      <!-- Test if the new word fits into the last line -->
    <xsl:template name="fillLine">
      <xsl:param name="pLine" select="/.."/>
      <xsl:param name="pWord" select="/.."/>
      <xsl:param name="pLineLength" />

      <xsl:variable name="vnWordsInLine" select="count($pLine/word)"/>
      <xsl:variable name="vLineLength" select="string-length($pLine) + $vnWordsInLine"/>
      <xsl:choose>
        <xsl:when test="not($vLineLength + string-length($pWord) > $pLineLength)">
          <line>
            <xsl:copy-of select="$pLine/*"/>
            <xsl:copy-of select="$pWord"/>
          </line>
        </xsl:when>
        <xsl:otherwise>
          <xsl:copy-of select="$pLine"/>
          <line>
            <xsl:copy-of select="$pWord"/>
          </line>
          <word/>
        </xsl:otherwise>
      </xsl:choose>
    </xsl:template>

</xsl:stylesheet>

当应用于以下 XML 文档时

<text>
Dec. 13 — As always for a presidential inaugural, security and surveillance were
extremely tight in Washington, DC, last January. But as George W. Bush prepared to
take the oath of office, security planners installed an extra layer of protection: a
prototype software system to detect a biological attack. The U.S. Department of
Defense, together with regional health and emergency-planning agencies, distributed
a special patient-query sheet to military clinics, civilian hospitals and even aid
stations along the parade route and at the inaugural balls. Software quickly
analyzed complaints of seven key symptoms — from rashes to sore throats — for
patterns that might indicate the early stages of a bio-attack. There was a brief
scare: the system noticed a surge in flulike symptoms at military clinics.
Thankfully, tests confirmed it was just that — the flu.
</text>

对齐文本以适应最多 64 行的行数(可以将任何长度指定为参数的值$pLineLength),结果为:

Dec. 13 — As always for a presidential inaugural, security and 
surveillance were extremely tight in Washington, DC, last 
January. But as George W. Bush prepared to take the oath of 
office, security planners installed an extra layer of 
protection: a prototype software system to detect a biological 
attack. The U.S. Department of Defense, together with regional 
health and emergency-planning agencies, distributed a special 
patient-query sheet to military clinics, civilian hospitals and 
even aid stations along the parade route and at the inaugural 
balls. Software quickly analyzed complaints of seven key 
symptoms — from rashes to sore throats — for patterns that might 
indicate the early stages of a bio-attack. There was a brief 
scare: the system noticed a surge in flulike symptoms at 
military clinics. Thankfully, tests confirmed it was just that — 
the flu. 

在上述转换中导入的单独样式表是:

str-foldl.xsl:


<xsl:stylesheet version="2.0" 
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:f="http://fxsl.sf.net/"
 exclude-result-prefixes="f">
    <xsl:template name="str-foldl">
      <xsl:param name="pFunc" select="/.."/>
      <xsl:param name="pA0"/>
      <xsl:param name="pStr"/>

      <xsl:choose>
         <xsl:when test="not(string($pStr))">
            <xsl:copy-of select="$pA0"/>
         </xsl:when>
         <xsl:otherwise>
            <xsl:variable name="vFunResult">
              <xsl:apply-templates select="$pFunc[1]" mode="f:FXSL">
                <xsl:with-param name="arg0" select="$pFunc[position() > 1]"/>
                <xsl:with-param name="arg1" select="$pA0"/>
                <xsl:with-param name="arg2" select="substring($pStr,1,1)"/>
              </xsl:apply-templates>
            </xsl:variable>

            <xsl:call-template name="str-foldl">
                    <xsl:with-param name="pFunc" select="$pFunc"/>
                    <xsl:with-param name="pStr" 
                   select="substring($pStr,2)"/>
                    <xsl:with-param name="pA0" select="$vFunResult"/>
            </xsl:call-template>
         </xsl:otherwise>
      </xsl:choose>

    </xsl:template>
</xsl:stylesheet>

请注意,这本质上是一个 XSLT 1.0 解决方案。使用 XSLT 2.0 的正则表达式处理功能可以实现更短的 XSLT 2.0 解决方案。


二、使用 XSLT 2.0 正则表达式

这是一个函数——f:getLine()当传递一个字符串和最大行长度时,返回该字符串的第一行,它是在字边界处结束的最长起始子字符串(第一个最大行长度块的)。下面的转换使用此函数生成所需多行结果的第一行。

<xsl:stylesheet version="2.0"  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:f="my:f" xmlns:xs="http://www.w3.org/2001/XMLSchema">
 <xsl:output method="text"/>

  <xsl:template match="/*/text()">
    <xsl:sequence select="f:getLine(., 64)"/>
  </xsl:template>

  <xsl:function name="f:getLine" as="xs:string?">
    <xsl:param name="pText" as="xs:string?"/>
    <xsl:param name="pLength" as="xs:integer"/>

    <xsl:variable name="vChunk" select="substring($pText, 1, $pLength)"/>

    <xsl:choose>
      <xsl:when test="not(string-length($pText) > $pLength) 
                      or matches(substring($pText, $pLength+1, 1), '\W')">
        <xsl:sequence select="$vChunk"/>
      </xsl:when>
      <xsl:otherwise>
            <xsl:analyze-string select="$vChunk" 
                 regex="^((\W*\w*)*?)(\W+\w*)$">
              <xsl:matching-substring>
                <xsl:sequence select="regex-group(1)"/>
              </xsl:matching-substring>
            </xsl:analyze-string>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:function>
</xsl:stylesheet>

当此转换应用于同一个 XML 文档时,会生成正确的第一行

Dec. 13 — As always for a presidential inaugural, security and

最后,使用 RegEx 进行完整的 XSLT 2.0 转换

<xsl:stylesheet version="2.0"  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:f="my:f" xmlns:xs="http://www.w3.org/2001/XMLSchema">
 <xsl:output method="text"/>

  <xsl:template match="/*/text()" name="reformat">
    <xsl:param name="pText" select="translate(., '&#xA;', ' ')"/>
    <xsl:param name="pMaxLength" select="64"/>
    <xsl:param name="pTotalLength" select="string-length(.)"/>
    <xsl:param name="pLengthFormatted" select="0"/>

    <xsl:if test="not($pLengthFormatted >= $pTotalLength)">
        <xsl:variable name="vNextLine" 
         select="f:getLine(substring($pText, $pLengthFormatted+1), $pMaxLength)"/>
        <xsl:sequence select="concat($vNextLine, '&#xA;')"/>

        <xsl:call-template name="reformat">
          <xsl:with-param name="pText" select="$pText"/>
          <xsl:with-param name="pMaxLength" select="$pMaxLength"/>
          <xsl:with-param name="pTotalLength" select="$pTotalLength"/>
          <xsl:with-param name="pLengthFormatted" 
                    select="$pLengthFormatted + string-length($vNextLine)"/>
        </xsl:call-template>
    </xsl:if>
  </xsl:template>

  <xsl:function name="f:getLine" as="xs:string?">
    <xsl:param name="pText" as="xs:string?"/>
    <xsl:param name="pLength" as="xs:integer"/>

    <xsl:variable name="vChunk" select="substring($pText, 1, $pLength)"/>

    <xsl:choose>
      <xsl:when test="not(string-length($pText) > $pLength) 
                      or matches(substring($pText, $pLength+1, 1), '\W')">
        <xsl:sequence select="$vChunk"/>
      </xsl:when>
      <xsl:otherwise>
            <xsl:analyze-string select="$vChunk" 
                 regex="^((\W*\w*)*?)(\W+\w*)$">
              <xsl:matching-substring>
                <xsl:sequence select="regex-group(1)"/>
              </xsl:matching-substring>
            </xsl:analyze-string>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:function>
</xsl:stylesheet>
于 2015-12-06T06:03:22.860 回答