我需要一个脚本来递归地遍历一个目录.xml
,并按最频繁到不太频繁的顺序解析每个文件和列表标签,并告诉每个标签出现多少次,以便统计最常用的标签。
我在考虑 Perl,但如果您认为有更好的方法,请告诉我。
我能够找到一个计算文档中单词的 perl 脚本
sub by_count {
$count{$b} <=> $count{$a};
}
open(INPUT, "<[Content_Types].xml");
open(OUTPUT, ">output");
$bucket = "";
while(<INPUT>){
@words = split(/\s+/);
foreach $word (@words){
if($word=~/($bucket)/io){
print OUTPUT "$word\n";
$count{$1}++;}
}
}
foreach $word (sort by_count keys %count) {
print OUTPUT "$word occurs $count{$word} times\n";
}
close INPUT;
close OUTPUT;
但是我在定义 $bucket 变量时遇到了麻烦,这个脚本打算定义像这样的桶
$bucket = "monkey | tree | banana"
输出就像
word monkey occurs 4 times
word monkey occurs 3 times
word monkey occurs 1 times
在我的情况下,我必须使用通配符,所以它会解析 <> 之间的所有内容,比如
$bucket = <"<*"."*>">;
但这会创建一个包含所有 xml 代码的输出文件,并计算每个添加的“<”和“>”并输出
occurs 50 times
我需要执行以下操作的东西:
.xml 文件示例:
<tag1 This is tag1 />
<tag1 This is tag1 />
<tag2 This is tag2 />
<tag2 This is tag2 />
<tag1 This is tag1 />
<tag2 This is tag2 />
<tag3 This is tag3 />
输出:
<tag1 This is tag1 /> appears 2 times
<tag2 This is tag2 /> appears 3 times
<tag3 This is tag3 /> appears 1 time
解决了:
#usr/bin/perl
sub by_count {
$count{$b} <=> $count{$a};
}
open(INPUT, "</file.xml"); #xml file
open(OUTPUT, ">outputfile"); #Create an output file
$bucket = qw/./;
while(<INPUT>){
@words = split(/\</); #Whenever reaches a '<' breaks the string
foreach $word (@words){
if($word=~/($bucket*>)/io){
#print OUTPUT "$word";
#print OUTPUT "\n\n";
$count{$1}++;}
}
}
foreach $word (sort by_count keys %count) {
print OUTPUT "<$word occurs $count{$word} times\n\n";
}
close INPUT;
close OUTPUT;
输出
<Default Extension="xlsx" ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"/> occurs 1 times
<Default Extension="png" ContentType="image/png"/> occurs 1 times
<Override PartName="/word/theme/theme1.xml" ContentType="application/vnd.openxmlformats-officedocument.theme+xml"/> occurs 1 times
谢谢大家的帮助,这真的很有帮助,感谢 cfrenz 他在我编辑的博客上放的代码
http://perlgems.blogspot.pt/2012/05/normal-0-false-false-false-en-us-x-none_2673.html