bash - “搜索和抓取”sh脚本的优化

Question

这是一个曾经运行良好的脚本，但现在我正在处理大量的 inode（大约 400K），它似乎会产生一些 I/O 缓慢。该脚本读取定义文件“def”，它是标识符列表，并且对于“dir”目录中的每个 400K 文件，如果在前 4 行中找到一个标识符，它会将整个文件内容附加到“def”特定文件的结尾。

#!/bin/sh
for def in *.def
do
        touch $def.out
        for file in $dir/*
        do
                if head -4 $file | grep -q -f  $def
                then
                        cat $file >> $def.out
                fi
        done
done

我怎样才能让它更快？

score 2 · Accepted Answer

Perl 解决方案。它应该比你的脚本快得多，因为

它从每个 .def 文件创建一个正则表达式。它不会多次读取每个 .def 文件。
它用于opendir读取目录内容。它比做 glob 快得多*，但作为惩罚，文件没有排序。要比较您和我的脚本的输出，您必须使用
```
diff <(sort $def.out) <(sort $def-new.out)
```
您可以将替换为opendiraglob以获得完全相同的输出。它减慢了脚本，但它仍然比旧脚本快得多。

脚本在这里：

#!/usr/bin/perl
use warnings;
use strict;

my $dir = 'd';              # Enter your dir here.

my @regexen;
my @defs = glob '*.def';
for my $def (@defs) {
    open my $DEF,   '<', $def           or die "$def: $!";
    open my $TOUCH, '>', "$def-new.out" or die "$def-new.out: $!";
    my $regex = q();
    while (<$DEF>) {
        chomp;
        $regex .= "$_|"
    }
    substr $regex, -1, 1, q();
    push @regexen, qr/$regex/;
}

# If you want the same order, uncomment the following 2 lines and comment the next 2 ones.
#
# for my $file (glob "$dir/*") {
#     $file =~ s%.*/%%;

opendir my $DIR, $dir or die "$dir: $!";
while (my $file = readdir $DIR) {
    next unless -f "$dir/$file";

    my %matching_files;
    open my $FH, '<', "$dir/$file" or die "$dir/$file: $!";
    while (my $line = <$FH>) {
        last if $. > 4;
        my @matches = map $line =~ /$_/ ? 1 : 0, @regexen;
        $matching_files{$_}++ for grep $matches[$_], 0 .. $#defs;
    }

    for my $i (keys %matching_files) {
        open my $OUT, '>>', "$defs[$i]-new.out" or die "$defs[$i]-new.out: $!";
        open my $IN,  '<',  "$dir/$file"        or die "$dir/$file: $!";
        print $OUT $_ while <$IN>;
        close $OUT;
    }
}

更新

现在可以多次获取文件。不是创建一个巨大的正则表达式，而是创建一个正则表达式数组，并将它们逐个匹配。

score 1 · Accepted Answer

我发现当我在单个文件夹中有超过 10,000 个文件时，我开始看到一些性能问题。发生这种情况时，即使是ls命令也可能需要几秒钟才能返回。

您的脚本似乎天生就是 IO 繁重的。它正在查看大量文件并创建或附加到大量文件。如果不更改脚本的运行方式，我看不出有什么可以改进的。

如果可以，请将其中一些数据移入数据库。与文件系统相比，数据库可以更容易地调整到这种数据规模。

score 0 · Accepted Answer

可以节省很多分叉；循环中保存的一个 fork 使整个脚本总共有 400K 的 fork。这就是我要做的。

不要触摸每个 *.def，而是大块触摸它们：

find . -name '*.def' | sed 's/\(.*\)/\1.out/' | xargs touch

（如果您的发现支持它，请使用find . -maxdepth 1...）

而不是两个命令管道，而是在一个命令中执行：

if awk "NR <= 4 && /$def/ { exit 0 } NR==5 { exit 1 }" $file; then

（不过，如果 $def 不包含元字符，请检查它。点应该没问题。）

bash - “搜索和抓取”sh脚本的优化

3 回答 3

更新

Related

Reference