xml - 解析 .xml 文件并列出标签的脚本

Question

我需要一个脚本来递归地遍历一个目录.xml，并按最频繁到不太频繁的顺序解析每个文件和列表标签，并告诉每个标签出现多少次，以便统计最常用的标签。

我在考虑 Perl，但如果您认为有更好的方法，请告诉我。

我能够找到一个计算文档中单词的 perl 脚本

sub by_count {
   $count{$b} <=> $count{$a};
}

open(INPUT, "<[Content_Types].xml");
open(OUTPUT, ">output");
$bucket = "";

while(<INPUT>){
   @words = split(/\s+/);
   foreach $word (@words){
            if($word=~/($bucket)/io){

      print OUTPUT "$word\n";
      $count{$1}++;}

   }
}
foreach $word (sort by_count keys %count) {

   print OUTPUT "$word occurs $count{$word} times\n";

}

close INPUT;
close OUTPUT;

但是我在定义 $bucket 变量时遇到了麻烦，这个脚本打算定义像这样的桶

$bucket = "monkey | tree | banana"

输出就像

word monkey occurs 4 times
word monkey occurs 3 times
word monkey occurs 1 times

在我的情况下，我必须使用通配符，所以它会解析 <> 之间的所有内容，比如

$bucket = <"<*"."*>">;

但这会创建一个包含所有 xml 代码的输出文件，并计算每个添加的“<”和“>”并输出

occurs 50 times

我需要执行以下操作的东西：

.xml 文件示例：

<tag1 This is tag1 />
<tag1 This is tag1 />
<tag2 This is tag2 />
<tag2 This is tag2 />
<tag1 This is tag1 />
<tag2 This is tag2 />
<tag3 This is tag3 />

输出：

<tag1 This is tag1 /> appears 2 times 
<tag2 This is tag2 /> appears 3 times 
<tag3 This is tag3 /> appears 1 time

解决了：

#usr/bin/perl

sub by_count {
   $count{$b} <=> $count{$a}; 
}

open(INPUT, "</file.xml"); #xml file
open(OUTPUT, ">outputfile"); #Create an output file
$bucket = qw/./;


while(<INPUT>){
   @words = split(/\</); #Whenever reaches a '<' breaks the string

   foreach $word (@words){
            if($word=~/($bucket*>)/io){

      #print OUTPUT "$word";
      #print OUTPUT "\n\n";
      $count{$1}++;}

   }
}
foreach $word (sort by_count keys %count) {

   print OUTPUT "<$word occurs $count{$word} times\n\n";

}

close INPUT;
close OUTPUT;

输出

<Default Extension="xlsx" ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"/> occurs 1 times

<Default Extension="png" ContentType="image/png"/> occurs 1 times

<Override PartName="/word/theme/theme1.xml" ContentType="application/vnd.openxmlformats-officedocument.theme+xml"/> occurs 1 times

谢谢大家的帮助，这真的很有帮助，感谢 cfrenz 他在我编辑的博客上放的代码

http://perlgems.blogspot.pt/2012/05/normal-0-false-false-false-en-us-x-none_2673.html

score 3 · Accepted Answer

仅举一个用于查询 XML 文件的语言 XQuery 的示例：

for $element in //*
let $name := $element/local-name()
group by $name
order by count($element) descending
return concat($name, ": ", count($element))

如何将此应用于多个 XML 文档取决于您使用的查询处理器，根据您的需要，您可以在 XQuery 中执行此操作，也可以使用 find 或其他方法为每个文件调用脚本。

要执行你需要一个 XQuery 处理器，对于这个例子，我将推荐开源软件BaseX；您也可以使用所有其他 XQuery 引擎。确保安装它，以便您也拥有命令行包装器；通过下载和安装或使用 Debian 和 Ubuntu 中的“basex”软件包。

将上面的脚本存储在文件 heretest.xq中，并调用 usefind为当前文件夹中的每个 XML 文件调用它：

find . -name "*.xml" -exec basex -i {} test.xq \;

它将打印每个文件的统计信息。

score 2 · Accepted Answer

Oneliner using xml2:

find . -type f -name '*.xml' -print0 | \
    xargs -0 -n 1 sh -c 'xml2 < "$0"' | \
    grep -v '/@' | cut -d=  -f 1 | uniq | grep -o '[^/]\+$' | \
    sort | uniq -c | sort -rn

Example output:

  48376 id
  16125 username
  16125 title
  16125 timestamp
  16125 sha1
  16125 ns
  16106 text
  14711 page
  10436 comment
   8032 minor
   4978 data
   4977 track
   4977 timecode
   4455 BlockGroup
   2262 ReferenceBlock
   1414 sitename
   1414 namespace
   1414 generator
   1414 case
   1414 base
    385 SimpleBlock
    142 discardable
    137 Timecode
    130 Cluster
    126 keyframe
     40 !
     38 name
     28 TrackType
...

Update:

Variant that "extracts everything between < and >", yet still using xml2 to handle XML correctly:

find . -type f -name '*.xml' -print0 | xargs -0 -n 1 sh -c 'xml2 < "$0"' | sed 's!^\([^@=]*\)=.*!\1=!'  | 2xml | sed 's!>!>\n!g' | grep -v '^</' | sed 's!^<!!; s!/\?>!!;' | sort | uniq -c | sort -rn

Example output:

   4986 id
   1662 username
   1662 title
   1662 timestamp
   1662 sha1
   1662 revision
   1662 page
   1662 ns
   1662 contributor
   1303 comment
    631 minor
    170 text xml:space="preserve" bytes="72"
     84 sitename
     84 siteinfo
     84 namespaces
     84 namespace key="9" case="first-letter"
     84 namespace key="8" case="first-letter"
     84 namespace key="7" case="first-letter"
     84 namespace key="6" case="first-letter"
     84 namespace key="5" case="first-letter"
...

Update 2 Another attempt to understand what you want:

my input sample:

<q>
    <w tag="11"/>
    <w tag="22"/>
    <r/>
    <r/>
    <w tag="22"/>
    <w/>
    <w/>
    <w>ignore me
    </w>
    <r   />
    <ololo>
        <r />
        <!--
        <w tag="33"/>
        -->
    </ololo>
</q>

Script:

cat q.xml | xml2  | sed 's!^\([^@=]*\)=.*!\1=!' | grep -v '/!=' | 2xml | xmllint -format - | sed 's/^\s*//g' | grep -v '^</\|^$' | sed 's!/\?>$!/>!' | sort | uniq -c | sort -rn

Output:

  4 <r/>
  3 <w/>
  2 <w tag="22"/>
  1 <?xml version="1.0"?/>
  1 <w tag="11"/>
  1 <q/>
  1 <ololo/>

Is it something like what you want?

score 0 · Accepted Answer

对于您提供的输入（不是有效的 XML）

<tag1 This is tag1 />
<tag1 This is tag1 />
<tag2 This is tag2 />
<tag2 This is tag2 />
<tag1 This is tag1 />
<tag2 This is tag2 />
<tag3 This is tag3 />

您可以使用基本的 unix 工具：

$ sort <input.txt |uniq -c

这将返回：

3 <tag1 This is tag1 />
3 <tag2 This is tag2 />
1 <tag3 This is tag3 />

xml - 解析 .xml 文件并列出标签的脚本

3 回答 3

Related

Reference