0

I'm extracting data from a TMX - an xml-based translation memory file. The file looks like this (<tu> entries are multiple, one for each translated string):

<?xml version="1.0" encoding="utf-8"?>
<tmx version="1.4">
  <header creationtool="Multilizer" creationtoolversion="6.2.19" datatype="PlainText" segtype="sentence" adminlang="en" srclang="*all*" o-tmf="MLProject">
  </header>
  <body>
    <tu>
      <prop type="context">..\..\BuildProcess\Support_Files\CommonFiles\PSRIP\AlambicEdit.dll.Strings.126.2000</prop>
      <tuv xml:lang="en-CA">
        <seg>Error initializing library: %s.</seg>
      </tuv>
      <tuv xml:lang="en">
        <prop type="status">tsQAed</prop>
        <seg>Error initializing library: %s.</seg>
      </tuv>
      <tuv xml:lang="fr">
        <prop type="status">tsQAed</prop>
        <seg>Erreur lors de l'initialisation de la librairie %s.</seg>
      </tuv>
      <tuv xml:lang="de">
        <prop type="status">tsQAed</prop>
        <seg>Fehler bei der Initialisierung der Bibliothek: %s.</seg>
      </tuv>
      <tuv xml:lang="es">
        <prop type="status">tsQAed</prop>
        <seg>Error inicializando biblioteca: %s.</seg>
      </tuv>
      <tuv xml:lang="it">
        <prop type="status">tsQAed</prop>
        <seg>Errore di inizializzazione libreria: %s.</seg>
      </tuv>
      <tuv xml:lang="ja">
        <prop type="status">tsQAed</prop>
        <seg>ライブラリ初期化時のエラー: %s</seg>
      </tuv>
      <tuv xml:lang="zh-CN">
        <prop type="status">tsQAed</prop>
        <seg>初始化库时出错:%s。&lt;/seg>
      </tuv>
      <tuv xml:lang="pt">
        <prop type="status">tsQAed</prop>
        <seg>Erro ao inicializar biblioteca: %s.</seg>
      </tuv>
    </tu>
  </body>
</tmx>

I need to extract specific languages in a specific order that's not always respected in the TMX, for instance the DE and ES languages are sometimes inverted.

Unfortunately, I haven't found a way to get a child element by the value of it's properties, so I can't do something like segment = x.getElementsByPropertyValue("xml:lang", "en") , which would be really awesome.

The only alternative I've found was to loop through all of the languages and check them against a properly sorted language array (which would be horribly slow on 600k+ entries in 10 different files).

Is there something obvious I'm missing? Is there such a method?

Note: I'm in WSH Javascript, so I have access to any ActiveXObject available in WSH...

4

2 回答 2

0

如果您的环境支持querySelector/querySelectorAll,请尝试:

xmldoc.querySelector("tuv[xml\\:lang='es']");

如果没有,恐怕循环是唯一的方法。您当然可以考虑使用像 jQuery 这样的库来为您执行循环。

于 2013-09-18T19:34:18.107 回答
0

因为我所需要的只是把事情按正确的顺序排列,我想简单地学习一点 XSLT 并不是世界上最糟糕的事情,足以将文件转换成我需要的东西。值得庆幸的是,XSLT 可以输出到文本,这是我的选择之一,与JavaScript 相比,它的速度非常快……这是我的解决方案

<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" encoding="UTF-8" indent="no" omit-xml-declaration="yes" />

<xsl:template match="/">
<xsl:for-each select="tmx/body/tu">
   <xsl:text>[EN]   </xsl:text><xsl:value-of select="tuv[@xml:lang = 'en']/seg"/><xsl:text>&#13;</xsl:text>
   <xsl:text>[FR]   </xsl:text><xsl:value-of select="tuv[@xml:lang = 'fr']/seg"/><xsl:text>&#13;</xsl:text>
   <xsl:text>[ES]   </xsl:text><xsl:value-of select="tuv[@xml:lang = 'es']/seg"/><xsl:text>&#13;</xsl:text>
   <xsl:text>[DE]   </xsl:text><xsl:value-of select="tuv[@xml:lang = 'de']/seg"/><xsl:text>&#13;</xsl:text>
   <xsl:text>[IT]   </xsl:text><xsl:value-of select="tuv[@xml:lang = 'it']/seg"/><xsl:text>&#13;</xsl:text>
   <xsl:text>[PT]   </xsl:text><xsl:value-of select="tuv[@xml:lang = 'pt']/seg"/><xsl:text>&#13;</xsl:text>
   <xsl:text>[JA]   </xsl:text><xsl:value-of select="tuv[@xml:lang = 'ja']/seg"/><xsl:text>&#13;</xsl:text>
   <xsl:text>[ZH]   </xsl:text><xsl:value-of select="tuv[@xml:lang = 'zh-CN']/seg"/><xsl:text>&#13;</xsl:text>
   <xsl:text>[CAP]  Yes&#13;</xsl:text>
   <xsl:text>[PL]   MyProduct#13;</xsl:text>
   <xsl:text>[DPT]  &#13;</xsl:text>
   <xsl:text>[REG]  &#13;</xsl:text>
   <xsl:text>[SOU]  Terminology Extraction&#13;</xsl:text>
   <xsl:text>[NOT]  </xsl:text><xsl:value-of select="prop"/><xsl:text>&#13;</xsl:text>
   <xsl:text>[HIS]  EL 2013/09/18&#13;</xsl:text>
   <xsl:text>[~]&#13;&#13;</xsl:text>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>

我承认它既不优雅也不紧凑,但只要它有效,而且这是一个一次性的过程......这是可以接受的。

于 2013-09-18T22:04:45.387 回答