0

这只是我编写的第二个 perl 脚本,因此任何建设性的帮助/建议将不胜感激。另外,请注意我正在使用 Strawberry Perl 的 Windows 机器上工作。我知道 Perl 存在一个 Tidy 模块,但是(出于在本说明中不值得解释的原因)宁愿从脚本中调用 tidy.exe,而不是使用该模块。

我希望我的 perl 脚本做什么:

  1. 获取一个 html 文件,复制它,然后给它一个 .xml 扩展名。

  2. 在新形成的 .xml 文件上运行 tidy.exe 以使其成为格式良好的 xml。

  3. 从新创建的格式良好的 .xml 文件中去除 xhtml 命名空间

当我使用以下命令从命令行运行它时,G:\TestFolder>perl tidy_cleanup.pl它会产生所需的结果。但是,当我从图标触发脚本时,它会跳过上面列出的第 2 步。根据下面发布的代码,您知道它为什么会这样吗?

这是我的代码:

#!/usr/bin/perl

use strict;
use warnings;

use File::Basename;
use FileHandle;

my $basename;
my @files = glob("*.html");

foreach my $file (@files) {

  my $oldext   = ".html";
  my $newext   = ".xml";
  my $newerext = "v2.xml";

  my $newfile  = $file;
  $newfile     =~ s/$oldext/$newext/;

  my $newerfile = $newfile;
  $newerfile    =~ s/$newext/$newerext/;

  open IN, $file or die "Can't read source file $file: $\n";
  open OUT, ">$newfile" or die "Can't write on file $newfile: $!\n";

  print "Copying $file to $newfile\n";


{while(<IN>)

{  
print OUT $_;  

close(IN);
close(OUT);


}

my $xmltidy = "for \%i in ($newfile) do c:\\Tidy\\tidy.exe --output-xml yes --numeric-entities yes --doctype omit --quote-nbsp no -asxml -utf8 -numeric -m \"\%i\"";
system($xmltidy);


print "\nfinished running tidy \n\n";
}

  {
    open NEWIN,  "$newfile"    or die "Can't read source file $newfile: $!\n";
    open NEWOUT, ">$newerfile" or die "Can't write on file $newerfile: $!\n";

    print "Copying $newfile to $newerfile\n";
    {
      while (<NEWIN>) {
        if ( /(\<html)( xmlns="http:\/\/www.w3.org\/1999\/xhtml" xml:lang="en-GB")(.*)/ ) {
          print NEWOUT "<html$3";
        }
        else {
          print NEWOUT $_;
        }
      }

      close(NEWIN);
      close(NEWOUT);
    }
  }
}
4

2 回答 2

1

您的程序无法通过快捷方式运行的原因可能是它在错误的目录中查找 HTML 文件。当您从命令行运行perl tidy_cleanup.pl时,它会查看您当前的工作目录,但是当您设置快捷方式时,您需要在标记为 的字段中指定当前目录Start in:

但是,正如我在评论中所说,当您从 HTML 复制到 XML 时,您只处理文件的一行,因为您关闭了while循环内的文件句柄。

这就是我会写我认为你想要的东西的方式。

use strict;
use warnings;
use autodie;

use File::Copy 'copy';

my $tidy = 'C:\Tidy\tidy.exe';
die "'tidy.exe' not found" unless -f $tidy;

for my $html_file (glob '*.html') {

  (my $xml_file = $html_file) =~ s/\.html\z/.xml/;
  copy $html_file, $xml_file;

  print qq{Tidying "$xml_file"\n};

  qx{"$tidy" --output-xml yes --numeric-entities yes --doctype omit --quote-nbsp no -asxml -utf8 -numeric -m "$xml_file"};

  print "Finished running tidy\n\n";

  (my $v2_file = $xml_file) =~ s/\.xml\z/_v2.xml/;
  open my $xml_fh,  '<', $xml_file;
  open my $v2_fh,   '>', $v2_file;

  print qq{Copying "$xml_file" to "$v2_file"\n};

  while (<$xml_fh>) {
    s/\s*xmlns="[^"]+"//;
    s/\s*xml:lang="[^"]+"//;
    print $v2_fh $_;
  }

  print "Copy complete\n\n";
}
于 2014-07-21T15:22:50.030 回答
0
use strict;
use warnings;
use File::Basename;
use FileHandle;

my @files = glob("*.html");
foreach my $file (@files) {

my $oldext = ".html";
my $newext = ".xml";
my $newerext = "v2.xml";
my $newfile = $file;
$newfile =~ s/$oldext/$newext/;

my $newerfile = $newfile;
$newerfile =~ s/$newext/$newerext/;

open IN, $file or die "Can't read source file $file: $\n";
open OUT, ">$newfile" or die "Can't write on file $newfile: $!\n";
print "Copying $file to $newfile\n";
{while(<IN>)

{  
print OUT $_;    
close(OUT);
my $xmltidy = "c:\\Tidy\\tidy.exe --output-xml yes --numeric-entities yes --doctype omit --quote-nbsp no -asxml -utf8 -numeric -m \"$newfile\"";
system($xmltidy);
print "\nfinished running tidy \n\n";
{
open NEWIN, "$newfile" or die "Can't read source file $newfile: $!\n";
open NEWOUT, ">$newerfile" or die "Can't write on file $newerfile: $!\n";
print "Copying $newfile to $newerfile\n";

{while(<NEWIN>)
{
  if(/(\<html)( xmlns="http:\/\/www.w3.org\/1999\/xhtml" xml:lang="en-GB")(.*)/) {      
        print NEWOUT "<html$3";             
     }         
   else {           
           print NEWOUT $_;
           }     
}
close(NEWIN);
close(NEWOUT);
}
}    
}
close(IN);
}
}
于 2014-07-22T13:42:03.237 回答