html - 使用 Perl 从目录中的文件中剥离 HTML

Question

几天前我问了一个关于使用 PERL 从文件中剥离 HTML 的问题。我是 n00b 并且我已经在网站上搜索了我的问题的答案...但不幸的是我找不到任何东西...这可能是因为我是 n00b 并且当我是时我没有看到答案看着它。

所以，情况就是这样。我有一个包含大约 20 GB 文本文件的目录。我想从每个文件中剥离 HTML 并将每个文件输出到一个唯一的文本文件。我已经编写了下面的程序，它似乎可以解决目录中的前 12 个文本文件（总共大约 12,000 个文本文件）......但是......我遇到了一些障碍。第一个问题是，在解析第 12 个文本文件之后，我开始收到有关深度递归的警告……然后不久程序就退出了，因为我的内存不足。我想我的编程效率极低。所以，我想知道你们中是否有人看到我的代码有任何明显的错误，这会导致我内存不足。...一旦我弄清楚了，希望我能做出贡献。

#!/usr/bin/perl -w
#use strict;
use Benchmark;
#get the HTML-Format package from the package manager.
use HTML::Formatter;
#get the HTML-TREE from the package manager
use HTML::TreeBuilder;
use HTML::FormatText;
$startTime = new Benchmark;
my $direct="C:\\Directory";
my $slash='\\';

opendir(DIR1,"$direct")||die "Can't open directory";
my @New1=readdir(DIR1);

foreach $file(@New1)
{

if ($file=~/^\./){next;}
#Initialize the variable names.
my $HTML=0;
my $tree="Empty";
my $data="";
#Open the file and put the file in variable called $data

{
    local $/;
    open (SLURP, "$direct$slash"."$file") or die "can't open $file: $!"; 
    #read the contents into data
    $data = <SLURP>; 

    #close the filehandle called SLURP
    close SLURP or die "cannot close $file: $!";
    if($data=~m/<HTML>/i){$HTML=1;}
    if($HTML==1)
        {
            #the following steps strip out any HTML tags, etc.
            $tree=HTML::TreeBuilder->new->parse($data);
            $formatter=HTML::FormatText->new(leftmargin=> 0, rightmargin=>60);
            $Alldata=$formatter->format($tree); 
        }
}
#print
my $outfile = "out_".$file;
open (FOUT, "> $direct\\$outfile");
print FOUT "file: $file\nHTML: $HTML\n$Alldata\n","*" x 40, "\n" ;
close(FOUT);

}


$endTime = new Benchmark;
$runTime = timediff($endTime, $startTime);
print ("Processing files took ", timestr($runTime));

score 2 · Accepted Answer

您在@New1.

此外，如果您使用的是旧版本，HTML::TreeBuilder则此类的对象可能需要明确删除，因为它们曾经不受自动 Perl 垃圾收集的影响。

这是一个避免这两个问题的程序，通过增量读取目录，并通过使用HTML::FormatText->format_string格式化文本，隐式删除HTML::TreeBuilder它创建的任何对象。

此外，File::Spec构建绝对文件路径的工作更加整洁，它是一个核心模块，因此不需要在您的系统上安装

use strict;
use warnings;

use File::Spec;
use HTML::FormatText;

my $direct = 'C:\Directory';

opendir my $dh, $direct or die "Can't open directory";

while ( readdir $dh ) {

  next if /^\./;

  my $file = File::Spec->catfile($direct, $_);
  my $outfile = File::Spec->catfile($direct, "out_$_");
  next unless -f $file;

  my $html = do {
    open my $fh, '<', $file or die qq(Unable to open "$file" for reading: $!);
    local $/;
    <$fh>;
  };

  next unless $html =~ /<html/i;

  my $formatted = HTML::FormatText->format_string(
      $html, leftmargin => 0, rightmargin => 60);

  open my $fh, '>', $outfile or die qq(Unable to open "$outfile" for writing: $!);

  print $fh "File: $file\n\n";
  print $fh "$formatted\n";
  print $fh "*" x 40, "\n" ;

  close $fh or die qq(Unable to close "$outfile" after writing: $!);
}

score 1 · Accepted Answer

你上一个问题的答案有什么问题？

您在不检查返回码的情况下打开文件进行写入。你确定成功了吗？您在哪个目录中创建文件？

更好的方法是：

逐个读取文件
剥离 HTML
在正确的目录中写出新文件并检查返回码

就像是：

while ( my $file = readdir DIR ) {

    ....process file

    open my $newfile, '>', "$direct/out_$outfile" or die "cannot open $outfile: $!\n";

   ... etc
}

score 0 · Accepted Answer

如何减少此应用程序的内存占用：

当您添加$tree = $tree->delete到循环的末尾时，问题是否仍然存在？

perl 垃圾收集器无法解析循环引用；所以你必须手动销毁树，这样你就不会耗尽内存。

（参见http://metacpan.org/pod/HTML::TreeBuilder的模块文档中的第一个示例）

你应该把readdir里面的循环。按照您的编码方式，您首先阅读了这个巨大的文件列表。当你说

my $file;
while (defined($file = readdir DIR1)) {..}

一次实际上只读取一个条目。应该节省一些额外的内存。

关于风格的其他一些评论：

默认值

你给$tree的默认值"Empty"。这是完全没有必要的。如果要显示变量的未定义程度，请将其设置为undef，默认情况下为。Perl 保证这种初始化。

反斜杠

您使用反斜杠作为目录分隔符？不用担心，只需使用正常的斜线即可。除非您在 DOS 上，否则您也可以使用普通斜杠，Windows 并不是那么愚蠢。

语句修饰符

这条线

if ($file=~/^\./){next;}

可以写成更具可读性

next if $file =~ /^\./;

随后使用括号

您对函数参数列表使用括号是无关紧要的。除非有歧义，否则您可以省略所有内置函数的括号。我更喜欢避开它们，其他人可能会发现它们更容易阅读。但请坚持一种风格！

更好的正则表达式

您测试/<HTML>/i. 如果我告诉你html标签可以有属性怎么办？您应该考虑测试/<html/i.

简化（删除另一个错误）

你的测试

if($data=~m/<HTML>/i){$HTML=1;}
if($HTML==1) {...}

可以写成

$HTML = $data =~ /<html/i;
if ($HTML == 1) {...}

可以写成

$HTML = $data =~ /<html/i
if ($HTML) {...}

可以折叠成

if ($data =~ /<html/i) {...}

您实现它的方式，该$HTML变量从未重置为错误值。因此，一旦文件包含 html，所有后续文件也将被视为 html。您可以通过在最内部合理的范围内定义您的变量来解决此类问题。

使用 HTML::FormatText，致敬@pavel

充分利用您使用的模块。看看我在示例中找到的内容HTML::FormatText：

my $string = HTML::FormatText->format_file(
           'test.html',
           leftmargin => 0, rightmargin => 50
           );

您可以轻松地调整它以规避手动构建树。你为什么不尝试这种方法，就像@pavel 在你的另一篇文章中告诉你的那样？本来可以为您节省内存问题...

使用严格

你为什么评论出来use strict？在学习一门语言时，尽可能多地获得致命警告很重要。或者在编写可靠的代码时。这将迫使您$file明智地声明所有变量。而use warnings不是-w开关，这有点过时了。

做得好

但是在检查close;-) 的返回值方面做得非常好，这非常不合时宜！