perl - sed / perl 正则表达式非常慢

Question

这篇文章中的私人信息。已移除。

score 4 · Accepted Answer

bash 脚本的问题在于，虽然非常灵活且功能强大，但它几乎可以为任何东西创建新进程，并且分叉成本高昂。在循环的每次迭代中，都会产生 3× echo、 2× awk、 1×sed和 1× perl。将自己限制在一个进程（因此，一种编程语言）将提高性能。

然后，您output.txt每次在调用perl. IO 总是很慢，所以缓冲文件会更有效率，如果你有内存的话。

如果没有哈希冲突，多线程就可以工作，但很难编程。与将 Perl 转换为多线程 Perl 相比，简单地转换为 Perl 将获得更大的性能提升。^{[需要引用]}

你可能会写类似

#!/usr/bin/perl
use strict; use warnings;
open my $cracked, "<", "cracked.txt" or die "Can't open cracked";
my @data = do {
  open my $output, "<", "output.txt" or die "Can't open output";
  <$output>;
};

while(<$cracked>) {
  my ($hash, $seed, $pwd) = split /:/, $_, 3;
  # transform $hash here like "$hash =~ s/foo/bar/g" if really neccessary

  # say which line we are at
  print "at line $. with pwd=$pwd\n";

  # do substitutions in @data
  s/\Q$hash\E/$hash ( $pwd )/ for @data;
  # the \Q...\E makes any characters in between non-special,
  # so they are matched literally.
  # (`C++` would match many `C`s, but `\QC++\E` matches the character sequence)
}

# write @data to the output file

（未经测试或任何东西，不保证）

虽然这仍然是一个O(n²)的解决方案，但它会比 bash 脚本执行得更好。请注意，当组织成散列树时，它可以减少到O(n) ，由散列码索引：@data

my %data = map {do magic here to parse the lines, and return a key-value pair} @data;
...;
$data{$hash} =~ s/\Q$hash\E/$hash ( $pwd )/; # instead of evil for-loop

实际上，您将存储对包含哈希树中包含哈希代码的所有行的数组的引用，因此前面的行宁愿是

my %data;
for my $line (@data) {
   my $key = parse_line($line);
   push @$data{$key}, $line;
}
...; 
s/\Q$hash\E/$hash ( $pwd )/ for @{$data{$hash}}; # is still faster!

另一方面，具有 8E7 元素的散列可能不会很好地执行。答案在于基准测试。

score 0 · Accepted Answer

在解析我的工作日志时，我会这样做：将文件拆分为 N 个部分（N=num_processors）；将分割点与 \n 对齐。启动 N 个线程来处理每个部分。工作速度非常快，但硬盘是瓶颈。

perl - sed / perl 正则表达式非常慢

2 回答 2

Related

Reference