shell - 优化 grep 的多次执行

Question

我有 3 个文件（links_file、my_links 和 my_queue），我正在用 links_file 做 3 件事：

删除具有重复信息的行（并非要检查所有行，只检查其中的一部分，在下面的代码中是 var img_url）。使用 img_url 保留第一行
删除 my_links 文件中存在 img_url 字符串的行
删除 my_queue 文件中存在 img_url 字符串的行

我有工作代码，但是在links_file 中大约需要30.000 行，在my_links 文件中需要1.000 行，在my_queue 文件中需要300 行，这需要很长时间（超过10 分钟）。

function clean_file(){
    links_file="$1"
    my_links="$2"
    my_queue="$3"
    out_file="$4"

    rm -rf "$out_file"
    prev_url=""
    cat "$links_file" | while read line
    do
        img_url=$(echo $line | perl -pe 's/[ \t].*//g' | perl -pe 's/(.*)_.*/$1/g')
        # $links_file is sorted by img_url, so i can just check the previous value
        test "$prev_url" = "$img_url" && echo "duplicate: $img_url" && continue
        prev_url="$img_url"
        test $(grep "$img_url" "$my_links" | wc -l) -ne 0 && echo "in my_links: $img_url" && continue
        test $(grep "$img_url" "$my_queue" | wc -l) -ne 0 && echo "in my_queue: $img_url" && continue
        echo "$line" >> "$out_file"
    done
}

我正在尝试优化代码，但没有想法。我对 perl 的了解有限（我通常只将它用于简单的正则表达式替换）。任何有助于优化这一点的帮助将不胜感激。

score 1 · Accepted Answer

让我们一步一步来。

首先，不需要调用 Perl 两次。而不是img_url=$(echo $line | perl -pe 's/[ \t].*//g' | perl -pe 's/(.*)_.*/$1/g')，你可以做

img_url=$(echo $line | perl -pe 's/[ \t].*//g;s/(.*)_.*/$1/g')

但是，我们可以将两个 regex' 组合在一起：

s/.*_([^ \t]*).*/$1/

（查找下划线后的一组非空字符）

sed此外，Perl 在足够的情况下是一种过度杀伤：

img_url=$(echo $line | sed "s/.*_\([^ \t]*\).*/\1/")

但是，嘿，也许 Perl 实际上应该是您选择的方法。您会看到，对于每个读取的 url，您都会完整地读取两个文件（队列和链接）以找到匹配的行。如果有一种方法可以读取它们并将库存保存在内存中！等一下。是的，我们可以在 bash 中做到这一点。不，我不想这样做:-)

下面的 Perl 脚本既不是特别复杂也不是特别优化，但应该比您的方法快得多。我试图让它易于理解；实际上，在某个级别之上（你肯定在那个级别上），Perl比 bash 写起来要简单得多。

#!/usr/bin/perl

use strict   ;
use warnings ;

my $my_links = "my_links" ;
my $my_queue = "my_queue" ;
# define the regular expression to find the img_url
my $regex = '.*_([^\s]*).*' ;

my %links = geturls( $my_links ) ;
my %queue = geturls( $my_queue ) ;

# loop over STDIN trying to find the match

my %index ;
while( <STDIN> ) {
  next unless m/$regex/ ; # ignore lines that do not match
  next if( $links{$1} || $queue{$1} || $index{$1} ) ; 
  $index{$1}++ ; # index hash to eliminate duplicates
  print $_ ;
} 

# function to store the two files (my_links and my_queue) in the memory.
# we populate a hash with the img urls read.
sub geturls {

  my $fname = shift ;
  my %ret ; 

  open my $fh, $fname or die "Cannot open $fname" ;

  while( <$fh> ) {
    next unless m/$regex/  ; # ignore lines that do not match
    # $1 holds the subexpression within the parentheses
    $ret{$1}++ ; 
  } 

  return %ret ;
}

该脚本将删除所有重复项，即使是那些不在连续行上的重复项——希望您不要介意。

不过需要注意的是：我假设所有文件都遵循类似的结构。下次您在这里提问时，请提供示例文件和所需的输出。

shell - 优化 grep 的多次执行

1 回答 1

Related

Reference