xml - 使用源文件中的数据从 XML 文件中获取块

Question

我修改了这个问题，因为我已经阅读了一些关于 XML 的内容。

我有一个包含 AuthNumbers 列表的文件源文件。 111222 111333 111444 etc.

我需要搜索该列表中的数字并在相应的 XML 文件中找到它们。在 xml 文件中，该行的格式如下： <trpcAuthCode>111222</trpcAuthCode>

这可以使用 grep 轻松实现，但是我需要包含交易的整个块。

该块以： <trans type="network sale" recalled="false">或<trans type="network sale" recalled="false" rollback="true">和/或其他一些变体开头。实际上<trans*>，如果这样的事情是可能的，那将是最好的。

该块以</trans>

它不需要优雅或高效。我只需要它工作。我怀疑一些交易正在退出，我需要一种快速的方法来审查那些没有被处理的交易。

如果有帮助，这里是原始（已消毒）xml 的链接 https://www.dropbox.com/s/cftn23tnz8uc9t8/main.xml?dl=0

我想提取的内容： https ://www.dropbox.com/s/b2bl053nom4brkk/transaction_results.xml?dl=0

每个结果的大小会有所不同，因为每笔交易的长度可能会根据购买的产品数量而有很大差异。在结果 xml 中，您会看到我根据 trpcAuthCode 列表 111222、111333、111444 提取了所需的 xml。

score 0 · Accepted Answer

关于 XML 和 awk 问题，您经常会发现专家们（如果他们的声誉是k的话）的评论，即 awk 中的 XML 处理是复杂的或不够充分的。据我了解，该脚本需要用于个人和/或调试目的。为此，我的解决方案应该足够了，但请记住，它不适用于任何合法的 XML 文件。

根据您的描述，脚本的草图是：

如果<trans*>匹配开始录制。
如果<trpcAuthCode>找到，则获取其内容并与列表进行比较。如果匹配，请记住输出块。
如果</trans>匹配则停止记录。如果已启用输出，则打印记录的块，否则将其丢弃。

因为我在SO: Shell scripting - split xml into multiple files中做了类似的事情，所以这应该不会太难实现。

不过，还需要一项附加功能：将 AuthNumbers 数组输入脚本。由于一个令人惊讶的巧合，我今天早上在SO：How to access an array in an awk, which is declared in a different awk in shell? （感谢jas的评论）。

所以，把它完全放在一个脚本中filter-trpcAuthCode.awk：

BEGIN {
  record = 0 # state for recording
  buffer = "" # buffer for recording
  found = 0 # state for found auth code
  # build temp. array from authCodes which has to be pre-defined
  split(authCodes, list, "\n")
  # build final array where values become keys
  for (i in list) authCodeList[list[i]]
  # for debugging: output of authCodeList
  print "<!-- authCodeList:"
  for (authCode in authCodeList) {
    print authCode
  }
  print "-->"
}

/<trans( [^>]*)?>/ {
  record = 1 # start recording
  buffer = "" # clear buffer
  found = 0 # reset state for found auth code
}

record {
  buffer = buffer"\n"$0 # record line (if recording is enabled)
}

record && /<trpcAuthCode>/ {
  # extract auth code
  authCode = gensub(/^.*>([^<]*)<\/trpcAuthCode.*$/, "\\1", "g")
  # check whether auth code in authCodeList
  found = authCode in authCodeList
}

/<\/trans>/ {
  record = 0 # stop recording
  # print buffer if auth code has been found
  if (found) {
    print buffer
  }
}

笔记：

split()我最初在应用on authCodesin时很挣扎BEGIN。这将创建一个数组，其中拆分值与枚举键一起存储。因此，我寻找一种解决方案来使值本身成为数组的键。（否则，in运算符不能用于搜索。）我在接受的SO 答案中找到了一个优雅的解决方案：检查数组是否包含值。
我实现了建议的模式<trans*>，/<trans( [^>]*)?/它甚至可以匹配<trans>（虽然<trans>似乎没有属性就不会发生）但不是<transSet>.
将
buffer = buffer"\n"$0
当前行附加到先前的内容。$0包含没有换行符的行。因此，它必须重新插入。我是怎么做到的，缓冲区以换行符开头，但最后一行没有结束。考虑到print buffer在文本末尾添加了一个换行符，这对我来说很好。或者，上面的代码片段可以替换为
buffer = buffer $0 "\n"
甚至
buffer = (buffer != "" ? buffer"\n" : "") $0.
（这是一个品味问题。）
过滤后的文件简单地打印到标准输出通道。它可能会被重定向到一个文件。考虑到这一点，我将附加/调试输出格式化为 XML 注释。
如果您对 awk 有点熟悉，您可能会注意到next我的脚本中没有任何语句。这是故意的。换句话说，规则的顺序是精心选择的，这样一行可以被所有规则连续处理/影响。（我测试了一个极端情况：
<trans><trpcAuthCode>111222</trpcAuthCode></trans>
即使这样处理也正确。）

为了简化测试，我添加了一个包装 bash 脚本filter-trpcAuthCode.sh

#!/usr/bin/bash
# uncomment next line for debugging
#set -x
# check command line arguments
if [[ $# -ne 2 ]]; then
  echo "ERROR: Illegal number of command line arguments!"
  echo ""
  echo "Usage:"
  echo $(basename $0) " XML_FILE AUTH_CODES"
  exit 1
fi
# call awk script
awk -v authCodes="$(cat <$2)" -f filter-xml-trpcAuthCode.awk "$1"

我针对您的示例文件测试了脚本（在 Windows 10 上使用 cygwin 中的 bash）main.xml并得到了四个匹配的块。我有点担心输出，因为在您的示例输出中transaction_results.xml只有三个匹配块。但是从视觉上检查我的输出似乎是合适的。（所有四个匹配项都包含一个匹配<trpcAuthCode>元素。）

我减少了您的示例输入以进行演示sample.xml：

<?xml version="1.0"?>
<transSet periodID="1" periodname="Shift" longId="2017-04-27" shortId="052" site="12345">
  <trans type="periodClose">
    <trHeader>
    </trHeader>
  </trans>
  <printCashier>
    <cashier sysid="7" empNum="07" posNum="101" period="11">A.Dude</cashier>
  </printCashier>
  <trans type="printCashier">
    <trHeader>
      <cashier sysid="7" empNum="07" posNum="101" period="11">A.Dude</cashier>
      <posNum>101</posNum>
    </trHeader>
  </trans>
  <trans type="journal">
    <trHeader>
    </trHeader>
  </trans>
  <trans type="network sale" recalled="false">
    <trHeader>
      <termMsgSN type="FINANCIAL" term="908">31054</termMsgSN>
    </trHeader>
    <trPaylines>
      <trPayline type="sale" sysid="1" locale="DOLLAR">
        <trpCardInfo>
          <trpcAccount>1234567890123456</trpcAccount>
          <trpcAuthCode>532524</trpcAuthCode>
       </trpCardInfo>
      </trPayline>
    </trPaylines>
  </trans>
  <trans type="network sale" recalled="false">
    <trHeader>
      <termMsgSN type="FINANCIAL" term="908">31054</termMsgSN>
    </trHeader>
    <trPaylines>
      <trPayline type="sale" sysid="1" locale="DOLLAR">
        <trpPaycode mop="3" cat="1" nacstendercode="generic" nacstendersubcode="generic">CREDIT</trpPaycode>
        <trpAmt>61.77</trpAmt>
        <trpCardInfo>
          <trpcAccount>2345678901234567</trpcAccount>
          <trpcAuthCode>111222</trpcAuthCode>
        </trpCardInfo>
      </trPayline>
    </trPaylines>
  </trans>
  <trans type="periodClose">
    <trHeader>
      <date>2017-04-27T23:50:17-04:00</date>
    </trHeader>
  </trans>
  <endTotals>
    <insideSales>445938.63</insideSales>
  </endTotals>
</transSet>

对于其他示例输入，我只是将文本复制到文件中authCodes.txt：

111222
111333
111444

在示例会话中使用两个输入文件：

$ ./filter-xml-trpcAuthCode.sh
ERROR: Illegal number of command line arguments!

Usage:
filter-xml-trpcAuthCode.sh XML_FILE AUTH_CODES

$ ./filter-xml-trpcAuthCode.sh sample.xml authCodes.txt
<!-- authCodeList:
111222
111333
111444
-->

  <trans type="network sale" recalled="false">
    <trHeader>
      <termMsgSN type="FINANCIAL" term="908">31054</termMsgSN>
    </trHeader>
    <trPaylines>
      <trPayline type="sale" sysid="1" locale="DOLLAR">
        <trpPaycode mop="3" cat="1" nacstendercode="generic" nacstendersubcode="generic">CREDIT</trpPaycode>
        <trpAmt>61.77</trpAmt>
        <trpCardInfo>
          <trpcAccount>2345678901234567</trpcAccount>
          <trpcAuthCode>111222</trpcAuthCode>
        </trpCardInfo>
      </trPayline>
    </trPaylines>
  </trans>

$ ./filter-xml-trpcAuthCode.sh main.xml authCodes.txt >output.txt

$

最后一个命令将输出重定向到一个output.txt可以在之后检查或处理的文件。

xml - 使用源文件中的数据从 XML 文件中获取块

1 回答 1

Related

Reference