xml - DOS Batch : dealing with double quotes from XML files

Question

I have written the code below to read XML files (file_1.xml and file_2.xml) and to extract the string between tags and to write it down into a TXT file. The issue is that some strings include double quotation marks and the program then takes these characters as being proper instructions (not part of the strings)...

Content of file_1.xml :

<AAA>C086002-T1111</AAA>
<AAA>C086002-T1222 </AAA>
<AAA>C086002-TR333 "</AAA>
<AAA>C086002-T5444  </AAA>

Content of file_2.xml :

<AAA>C086002-T5555 </AAA>
<AAA>C086002-T1666</AAA>
<AAA>C086002-T1777 "</AAA>
<AAA>C086002-T1888          "</AAA>

My code :

@echo off

setlocal enabledelayedexpansion

for /f "delims=;" %%f in ('dir /b D:\depart\*.xml') do (

    for /f "usebackq delims=;" %%z in ("D:\depart\%%f") do (

        (for /f "delims=<AAA></AAA> tokens=2" %%a in ('echo "%%z" ^| Findstr /r "<AAA>"') do (

            set code=%%a
            set code=!code:""=!
            set code=!code: =!
            echo !code!

        )) >> result.txt
    )
)

I get this in result.txt :

C086002-T1111
C086002-T1222
C086002-T5444
C086002-T5555
C086002-T1666

In fact, 3 out of the 8 lines are missing. These lines include double quotation marks or follow lines that include double quotation marks...

How can I deal with these characters and consider them as parts of the strings ?

score 2 · Accepted Answer

请注意 - 使用批处理解析 XML 是一项有风险的业务，因为 XML 通常会忽略空格。只需将 XML 重新格式化为另一种等效的有效格式，您编写的任何脚本都可能会被破坏。话虽如此...

我没有追踪问题以完全解释您观察到的行为，但不平衡的引用导致此行出现问题：

(for /f "delims=<AAA></AAA> tokens=2" %%a in ('echo "%%z" ^| Findstr /r "<AAA>"') do (

您可以通过预先消除任何引号来消除该问题并让您的代码正常工作。

@echo off

setlocal enabledelayedexpansion
del result.txt
for /f "delims=;" %%f in ('dir /b D:\depart\*.xml') do (
  for /f "usebackq delims=;" %%z in ("D:\depart\%%f") do (
    set code=%%z
    set code=!code:"=!
    set code=!code: =!
    (for /f "delims=<AAA></AAA> tokens=2" %%a in ('echo "!code!" ^| Findstr /r "<AAA>"') do (
      echo %%a
    )) >> result.txt
  )
)

但是你有一个潜在的重大问题。DELIMS 不指定字符串 - 它指定字符列表。所以你DELIMS=<AAA></AAA>的等价于DELIMS=<>/A. 如果您的元素值中有一个 A 或 /，那么您的代码将失败。

有一个更好的方法：

首先，您可以使用 FINDSTR 一次<AAA>----</AAA>从所有文件中收集所有行，而无需任何循环：

findstr /r "<AAA>.*</AAA>" "D:\depart\*.xml"

每个匹配的行将作为文件路径输出，后跟一个冒号，然后是匹配的行，如下所示：

D:\depart\file_1.xml:<AAA>C086002-T1111</AAA>

文件路径永远不能包含<, 或>, 因此您可以使用以下内容来迭代结果，捕获适当的标记：

for /f "delims=<> tokens=3" %%A in ( ...

最后，您可以在整个循环周围加上括号，并且只重定向一次。我假设您希望每次运行都创建一个新文件，所以我使用>而不是>>.

@echo off
setlocal enabledelayedexpansion
>result.txt (
  for /f "delims=<> tokens=3" %%A in (
    'findstr /r "<AAA>.*</AAA>" "D:\depart\*.xml"''
  ) do (
    set code=%%A
    set code=!code:"=!
    set code=!code: =!
    echo(!code!
)

假设您只需要修剪前导或尾随空格/引号，那么解决方案就更简单了。将引号指定为 DELIM 字符确实需要奇怪的语法。^请注意，最后一个和之间有两个空格%%B。第一个转义空格被视为 DELIM 字符。未转义的空格终止 FOR /F 选项字符串。

@echo off
>result.txt (
  for /f "delims=<> tokens=3" %%A in (
    'findstr /r "<AAA>.*</AAA>" "D:\depart\*.xml"'
  ) do for /f delims^=^"^  %%B in ("%%A") do echo(%%B
)

更新以回应评论

我假设您的数据值永远不会包含冒号。

如果要将源文件名附加到每行输出，则只需更改第一个 FOR /F 即可捕获第一个标记（源文件）以及第三个标记（数据值）。该文件将包含完整路径以及尾随冒号。第二个 FOR /F 使用修饰符将文件附加到源数据字符串~nx以获取名称和扩展名（无驱动器或路径），并在 DELIMS 选项中添加一个冒号，以便删除尾随冒号。

@echo off
>result.txt (
  for /f "delims=<> tokens=1,3" %%A in (
    'findstr /r "<AAA>.*</AAA>" "D:\depart\*.xml"'
  ) do for /f delims^=:^"^  %%C in ("%%B;%%~nxA") do echo %%C
)

score 0 · Accepted Answer

如果我保留@dbenham 的建议并完成它以回显文件名：

@echo off
>result.txt (
    for /f %%f in ("D:\depart\*.xml") do (
        for /f "delims=<> tokens=3" %%A in ('findstr /r "<AAA>.*</AAA>" "D:\depart\*.xml"') do (
             for /f delims^=^"^  %%B in ("%%A") do (
               echo %%B;%%f
             )
         )
     )
 )

感谢您对此代码的意见！

xml - DOS Batch : dealing with double quotes from XML files

2 回答 2

Related

Reference