regex - 在 bash 中抓取一个 20g 的文件

Question

关于代码性能的问题：我正在尝试针对 ~20g 文本文件运行 ~25 条正则表达式规则。脚本应该输出匹配到文本文件；每个正则表达式规则都会生成自己的文件。请看下面的伪代码：

regex_rules=~/Documents/rulesfiles/regexrulefile.txt
for tmp in *.unique20gbfile.suffix; do
    while read line
    # Each $line in the looped-through file contains a regex rule, e.g.,
    # egrep -i '(^| )justin ?bieber|(^| )selena ?gomez'
    # $rname is a unique rule name generated by a separate bash function
    # exported to the current shell.
        do
        cmd="$line $tmp > ~/outputdir/$tmp.$rname.filter.piped &"
        eval $cmd
    done < $regex_rules
done

夫妻心得：

有没有办法只循环一次文本文件，评估所有规则并一次性拆分为单个文件？这会更快吗？
我应该为这项工作使用不同的工具吗？

谢谢。

score 5 · Accepted Answer

这就是grep有-f选项的原因。减少你regexrulefile.txt的正则表达式，每行一个，然后运行

egrep -f regexrulefile.txt the_big_file

这会在单个输出流中生成所有匹配项，但是您可以在之后对其执行循环操作以将它们分开。假设匹配的组合列表并不大，这将是性能上的胜利。

score 2 · Accepted Answer

一个快速（！= 太快）的 Perl 解决方案：

#!/usr/bin/perl
use strict; use warnings;

我们预加载正则表达式，以便我们只读取一次它们的文件。它们存储在数组中@regex。正则表达式文件是作为参数给出的第一个文件。

open REGEXES, '<', shift(@ARGV) or die;
my @regex = map {qr/$_/} <REGEXES>;
# use the following if the file still includes the egrep:
# my @regex = map {
#     s/^egrep \s+ -i \s+ '? (.*?) '? \s* $/$1/x;
#     qr{$_}
# } <REGEXES>;
close REGEXES or die;

我们遍历作为参数给出的每个剩余文件：

while (@ARGV) {
  my $filename = shift @ARGV;

我们预先打开文件以提高效率：

  my @outfile = map {
     open my $fh, '>', "outdir/$filename.$_.filter.piped"
       or die "Couldn't open outfile for $filename, rule #$_";
     $fh;
  } (1 .. scalar(@rule));
  open BIGFILE, '<', $filename or die;

我们将所有符合规则的行打印到指定文件。

  while (not eof BIGFILE) {
    my $line = <BIGFILE>;
    for $ruleNo (0..$#regex) {
      print $outfile[$ruleNo] $line if $line =~ $regex[$ruleNo];
      # if only the first match is interesting:
      # if ($line =~ $regex[$ruleNo]) {
      #     print $outfile[$ruleNo] $line;
      #     last;
      # }
    }
  }

在下一次迭代之前清理：

  foreach (@outfile) {
    close $_ or die;
  }
  close BIGFILE or die;
}

print "Done";

调用：$ perl ultragrepper.pl regexFile bigFile1 bigFile2 bigFile3等等。任何更快的东西都必须直接用 C 语言编写。您的硬盘数据传输速度是极限。

这应该像 bash 挂件一样运行得更快，因为我避免重新打开文件或重新解析正则表达式。此外，不必为外部工具生成新流程。但是我们可以产生几个线程！（至少 NumOfProcessors * 2 个线程可能是明智的）

local $SIG{CHLD} = undef;
while (@ARGV) {
    next if fork();
    ...;
    last;
}

score 2 · Accepted Answer

我做了类似的事情lex。当然，它每隔一天运行一次，所以 YMMV。它非常快，甚至在远程 Windows 共享上的数百兆字节文件上也是如此。处理只需几秒钟。我不知道你破解一个快速C程序有多舒服，但我发现这是解决大规模正则表达式问题的最快、最简单的解决方案。

部分编辑以保护有罪者：

    /************************************************** 
        start of definitions section

    ***************************************************/


%{
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <getopt.h>
#include <errno.h>

char inputName[256];
// static insert variables

//other variables
char tempString[256];
char myHolder[256];
char fileName[256];
char unknownFileName[256];
char stuffFileName[256];
char buffer[5];

/* we are using pointers to hold the file locations, and allow us to dynamically open and close new files */
/* also, it allows us to obfuscate which file we are writing to, otherwise this couldn't be done */

FILE *yyTemp;
FILE *yyUnknown;
FILE *yyStuff;

// flags for command line options
static int help_flag = 0;

%}

%option 8bit 
%option nounput nomain noyywrap 
%option warn

%%
    /************************************************ 
        start of rules section
    *************************************************/


(\"A\",\"(1330|1005|1410|1170)\") { 
    strcat(myHolder, yytext);
    yyTemp = &(*yyStuff);
} //stuff files

. { strcat(myHolder, yytext); }

\n  {
    if (&(*yyTemp) == &(*yyUnknown))
        unknownCount += 1;
    strcat(myHolder, yytext); 
    //print to file we are pointing at, whatever it is
    fprintf(yyTemp, "%s", myHolder);
    strcpy(myHolder, "");
    yyTemp = &(*yyUnknown);
}

<<EOF>> {
    strcat(myHolder, yytext); 
    fprintf(yyTemp, "%s", myHolder);
    strcpy(myHolder, "");
    yyTemp = &(*yyUnknown);

    yyterminate();
}

%%
    /**************************************************** 
        start of code section


    *****************************************************/


int main(int argc, char **argv);

int main (argc,argv)
int argc;
char **argv;
{
    /****************************************************
        The main method drives the program. It gets the filename from the
        command line, and opens the initial files to write to. Then it calls the lexer.
        After the lexer returns, the main method finishes out the report file,
        closes all of the open files, and prints out to the command line to let the
        user know it is finished.
    ****************************************************/

    int c;

    // the gnu getopt library is used to parse the command line for flags
    // afterwards, the final option is assumed to be the input file

    while (1) {
        static struct option long_options[] = {
            /* These options set a flag. */
            {"help",   no_argument,     &help_flag, 1},
            /* These options don't set a flag. We distinguish them by their indices. */
            {0, 0, 0, 0}
        };
           /* getopt_long stores the option index here. */
        int option_index = 0;
        c = getopt_long (argc, argv, "h",
            long_options, &option_index);

        /* Detect the end of the options. */
        if (c == -1)
            break;

        switch (c) {
            case 0:
               /* If this option set a flag, do nothing else now. */
               if (long_options[option_index].flag != 0)
                 break;
               printf ("option %s", long_options[option_index].name);
               if (optarg)
                 printf (" with arg %s", optarg);
               printf ("\n");
               break;

            case 'h':
                help_flag = 1;
                break;

            case '?':
               /* getopt_long already printed an error message. */
               break;

            default:
               abort ();
            }
    }

    if (help_flag == 1) {
        printf("proper syntax is: yourProgram.exe [OPTIONS]... INFILE\n");
        printf("splits csv file into multiple files")
        printf("Option list: \n");
        printf("--help                  print help to screen\n");
        printf("\n");
        return 0;
    }

    //get the filename off the command line and redirect it to input
    //if there is no filename then use stdin

    if (optind < argc) {
        FILE *file;

        file = fopen(argv[optind], "r");
        if (!file) {
            fprintf (stderr, "%s: Couldn't open file %s; %s\n", argv[0], argv[optind], strerror (errno));
            exit(errno);
        }
        yyin = file;
        strcpy(inputName, argv[optind]);
    }
    else {
        printf("no input file set, using stdin. Press ctrl-c to quit");
        yyin = stdin;
        strcpy(inputName, "\b\b\b\b\bagainst stdin");
    }

    //set up initial file names

    strcpy(fileName, inputName);
    strncpy(unknownFileName, fileName, strlen(fileName)-4);
    strncpy(stuffFileName, fileName, strlen(fileName)-4);

    strcat(unknownFileName, "_UNKNOWN_1.csv");
    strcat(stuffFileName, "_STUFF_1.csv");

    //open files for writing

    yyout = stdout;
    yyTemp = malloc(sizeof(FILE));
    yyUnknown = fopen(unknownFileName,"w");
    yyTemp = &(*yyUnknown);

    yyStuff = fopen(stuffFileName,"w");

    yylex();

    //close open files

    fclose(yyUnknown);

    printf("Lexer finished running %s",fileName);

    return 0;

}

要构建这个 flex 程序，安装 flex，并使用这个 makefile（调整路径）：

TARGET = project.exe
TESTBUILD = project
LEX = flex
LFLAGS = -Cf
CC = i586-mingw32msvc-gcc
CFLAGS = -O -Wall 
INSTALLDIR = /mnt/J/Systems/executables

.PHONY: default all clean install uninstall cleanall

default: $(TARGET)

all: default install

OBJECTS = $(patsubst %.l, %.c, $(wildcard *.l))

%.c: %.l
    $(LEX) $(LFLAGS) -o $@ $<

.PRECIOUS: $(TARGET) $(OBJECTS)

$(TARGET): $(OBJECTS)
    $(CC) $(OBJECTS) $(CFLAGS) -o $@

linux: $(OBJECTS)
    gcc $(OBJECTS) $(CFLAGS) -lm -g -o $(TESTBUILD)

cleanall: clean uninstall

clean:
    -rm -f *.c
    -rm -f $(TARGET)
    -rm -f $(TESTBUILD)

uninstall:
    -rm -f $(INSTALLDIR)/$(TARGET)

install:
    cp -f $(TARGET) $(INSTALLDIR)

score 1 · Accepted Answer

反转结构：读入文件，然后循环规则，以便您只在单独的行上执行匹配。

regex_rules=~/Documents/rulesfiles/regexrulefile.txt
for tmp in *.unique20gbfile.suffix; do
while read line ; do 
 while read rule
    # Each $line in the looped-through file contains a regex rule, e.g.,
    # egrep -i '(^| )justin ?bieber|(^| )selena ?gomez'
    # $rname is a unique rule name generated by a separate bash function
    # exported to the current shell.
        do
        cmd=" echo $line  | $rule  >> ~/outputdir/$tmp.$rname.filter.piped &"
        eval $cmd
    done < $regex_rules
done < $tmp

完毕

在这一点上，尽管您可以/应该使用 bash（或 perl 的）内置正则表达式匹配，而不是让它为每个匹配启动一个单独的 egrep 进程。您也许还可以拆分文件并运行并行进程。（注意我也将 > 更正为 >>）

score 1 · Accepted Answer

我还决定回到这里编写一个 perl 版本，然后才注意到 amon 已经完成了。既然已经写好了，那就是我的：

#!/usr/bin/perl -W
use strict;

# The search spec file consists of lines arranged in pairs like this:
# file1
# [Ff]oo
# file2
# [Bb]ar
# The first line of each pair is an output file. The second line is a perl
# regular expression. Every line of the input file is tested against all of
# the regular expressions, so an input line can end up in more than one
# output file if it matches more than one of them.

sub usage
{
        die "Usage: $0 search_spec_file [inputfile...]\n";
}

@ARGV or usage();

my @spec;

my $specfile = shift();
open my $spec, '<', $specfile or die "$specfile: $!\n";
while(<$spec>) {
        chomp;
        my $outfile = $_;
        my $regexp = <$spec>;
        chomp $regexp;
        defined($regexp) or die "$specfile: Invalid: Odd number of lines\n";
        open my $out, '>', $outfile or die "$outfile: $!\n";
        push @spec, [$out, qr/$regexp/];
}
close $spec;

while(<>) {
        for my $spec (@spec) {
                my ($out, $regexp) = @$spec;
                print $out $_ if /$regexp/;
        }
}

regex - 在 bash 中抓取一个 20g 的文件

5 回答 5

Related

Reference