regex - 如何使用正则表达式（在 Perl 或 *nix 终端中）匹配庞大语料库中列表中的单词？

Question

来自 .txt 文件中的给定名词列表，其中名词用新行分隔，例如：

hooligan
football
brother
bollocks

...以及一个单独的 .txt 文件，其中包含一系列由换行符分隔的正则表达式，如下所示：

[a-z]+\tNN(S)?
[a-z]+\tJJ(S)?

...我想通过语料库的每个句子运行正则表达式，并且每次正则表达式匹配一个模式时，如果该模式包含名词列表中的一个名词，我想在输出和（用制表符分隔）匹配它的正则表达式。以下是结果输出的示例：

football    [a-z]+NN(S)?\'s POS[a-z]+NN(S)?
hooligan    [a-z]+NN(S)?,,[a-z]+JJ[a-z]+NN(S)?
hooligan    [a-z]+NN(S)?,,[a-z]+JJ[a-z]+NN(S)?
football    [a-z]+NN(S)?[a-z]+NN(S)?
brother [a-z]+PP$[a-z]+NN(S)?
bollocks    [a-z]+DT[a-z]+NN(S)?
football    [a-z]+NN(s)?(be)VBZnotRB

我将使用的语料库很大（数十 GB）并且具有以下格式（每个句子都包含在 tag 中<s>）：

<s>
Hooligans   hooligan    NNS 1   4   NMOD
,   ,   ,   2   4   P
unbridled   unbridled   JJ  3   4   NMOD
passion passion NN  4   0   ROOT
-   -   :   5   4   P
and and CC  6   4   CC
no  no  DT  7   9   NMOD
executive   executive   JJ  8   9   NMOD
boxes   box NNS 9   4   COORD
.   .   SENT    10  0   ROOT
</s>
<s>
Hooligans   hooligan    NNS 1   4   NMOD
,   ,   ,   2   4   P
unbridled   unbridled   JJ  3   4   NMOD
passion passion NN  4   0   ROOT
-   -   :   5   4   P
and and CC  6   4   CC
no  no  DT  7   9   NMOD
executive   executive   JJ  8   9   NMOD
boxes   box NNS 9   4   COORD
.   .   SENT    10  0   ROOT
</s>
<s>
Portsmouth  Portsmouth  NP  1   2   SBJ
bring   bring   VVP 2   0   ROOT
something   something   NN  3   2   OBJ
entirely    entirely    RB  4   5   AMOD
different   different   JJ  5   3   NMOD
to  to  TO  6   5   AMOD
the the DT  7   12  NMOD
Premiership Premiership NP  8   12  NMOD
:   :   :   9   12  P
football    football    NN  10  12  NMOD
's  's  POS 11  10  NMOD
past    past    NN  12  6   PMOD
.   .   SENT    13  2   P
</s>
<s>
This    this    DT  1   2   SBJ
is  be  VBZ 2   0   ROOT
one one CD  3   2   PRD
of  of  IN  4   3   NMOD
Britain Britain NP  5   10  NMOD
's  's  POS 6   5   NMOD
most    most    RBS 7   8   AMOD
ardent  ardent  JJ  8   10  NMOD
football    football    NN  9   10  NMOD
cities  city    NNS 10  4   PMOD
:   :   :   11  2   P
think   think   VVP 12  2   COORD
Liverpool   Liverpool   NP  13  0   ROOT
or  or  CC  14  13  CC
Newcastle   Newcastle   NP  15  19  SBJ
in  in  IN  16  15  ADV
miniature   miniature   NN  17  16  PMOD
,   ,   ,   18  15  P
wound   wind    VVD 19  13  COORD
back    back    RB  20  19  ADV
three   three   CD  21  22  NMOD
decades decade  NNS 22  19  OBJ
.   .   SENT    23  2   P
</s>

我开始在 PERL 中编写一个脚本来实现我的目标，为了不让如此庞大的数据集耗尽内存，我使用了模块Tie::File以便我的脚本一次读取一行（而不是试图打开内存中的整个语料库文件）。这将与每个句子对应一行的语料库完美配合，但在当前句子分布在更多行上并由标签分隔的情况下则不行。

有没有办法使用组合 unix 终端命令（例如 cat 和 grep）来实现我想要的？或者，这将是这个问题的最佳解决方案？（一些代码示例会很棒）。

score 3 · Accepted Answer

一个简单的正则表达式替换就足以从名词列表中提取匹配数据，并且Regexp::Assemble可以处理从其他文件中识别哪个模式匹配的要求。而且，正如 Jonathan Leffler 在他的评论中提到的那样，设置输入记录分隔符允许您一次读取一条记录，即使每条记录跨越多行。

将所有这些结合到一个运行示例中，我们得到：

#!/usr/bin/env perl    

use strict;
use warnings;
use 5.010;

use Regexp::Assemble;

my @nouns = qw( hooligan football brother bollocks );
my @patterns = ('[a-z]+\s+NN(S)?', '[a-z]+\s+JJ(S)?');

my $name_re = '(' . join('|', @nouns) . ')'; # Assumes no regex metacharacters

my $ra = Regexp::Assemble->new(track => 1);
$ra->add(@patterns);

local $/ = '<s>';

while (my $line = <DATA>) {
  my $match = $ra->match($line);
  next unless defined $match;

  while ($line =~ /$name_re/g) {
    say "$1\t\t$match";
  }
}


__DATA__
...

...该__DATA__部分的内容是原始问题中提供的样本语料库。为了保持答案紧凑，我没有在此处包含它。另请注意，在这两种模式中，我都\t改为\s+; 这是因为当我复制和粘贴您的样本语料库时，这些选项卡没有保留。

运行该代码，我得到输出：

hooligan        [a-z]+\s+NN(S)?
hooligan        [a-z]+\s+NN(S)?
football        [a-z]+\s+NN(S)?
football        [a-z]+\s+NN(S)?
football        [a-z]+\s+JJ(S)?
football        [a-z]+\s+JJ(S)?

编辑：更正了正则表达式。我最初替换\t为\s，使其匹配NN或JJ仅在前面正好有一个空格时才匹配。它现在还匹配多个空格，这更好地模拟了原来的\t.

score 1 · Accepted Answer

我最终编写了一个快速代码来解决我的问题。正如 Jonathan Leffler 所建议的那样，我使用 Tie::File 来处理巨大的文本数据集并指定</s>为记录分隔符（Dave Sherohman 提出的解决方案似乎非常优雅，但我无法尝试）。在句子分离之后，我分离出我需要的列（第 2 和第 3 列）并运行正则表达式。在打印输出之前，我检查匹配的单词是否存在于我的单词列表中：如果没有，则从输出中排除。

我在这里分享我的代码（包括评论），以防其他人需要类似的东西。

它有点脏，它肯定可以优化，但它对我有用，它支持非常大的语料库（我用 10GB 的语料库测试它：它在几个小时内成功完成）。

use strict;
use Tie::File; #This module makes a file look like a Perl array, each array element corresponds to a line of the file.

if ($#ARGV < 0 ) {  print "Usage: perl albzcount.pl corpusfile\n"; exit; }

#read nouns list (.txt file with one word per line - line breaks LF)
my $nouns_list = "nouns.txt";
open(DAT, $nouns_list) || die("Could not open the config file $nouns_list or file doesn't exist!"); 
my @nouns_contained_in_list=<DAT>;
close(DAT);

# Reading regexp list (.txt file with one regexp per line - line breaks LF)
my $regex_list = "regexp.txt";
open(DAT, $regex_list) || die("Could not open the config file $regex_list or file doesn't exist!");
my @regexps_contained_in_list=<DAT>;
close(DAT);

# Reading Corpus File (each sentence is spread on more lines and separated by tag <s>)
my $corpusfile = $ARGV[0]; #Corpus filename (passed as an argument through the command)

# With TIE I don't load the entire file in an array. Perl thinks it's an array but the file is actually read line by line
# This is the key to manipulate huge text files without running out of memory
tie my @raw_corpus_data, 'Tie::File', $corpusfile,  recsep => '</s>' or die "Can't read file: $!\n";

#START go throught the sentences of the corpus (spread on multiple lines and separated by <s>), one by one
foreach my $corpus_line (@raw_corpus_data){

#take a single sentence (that is spread along different lines).
#NB each line contains "columns" separated by tab
my @corpus_sublines = split('\n', $corpus_line); 

#declare variable. Later values will be appended to it
my $corpus_line; 

    #for each line that composes a sentence
    foreach my $sentence_newline(@corpus_sublines){ a

    #explode by tab (column separator)
    my @corpus_columns = split('\t', $sentence_newline); 

    #put together new sentences using just column 2 and 3 (noun and tag) for each original sentence
    $corpus_line .= "@corpus_columns[1]\t@corpus_columns[2]\n";

    #... Now the corpus has the format I want and can be processed
    }

    #foreach regex
    foreach my $single_regexp(@regexps_contained_in_list){ 

        # Remove the new lines (both \n and \r - depending on the OS) from the regexp present in the file. 
        # Without this, the regular expressions read from the file don't always work.
        $single_regexp =~ s/\r|\n//g; 

            #if the corpus line analyzed in this cycle matches the regexp
            if($corpus_line =~ m/$single_regexp/) { 

            # explode by tab the matched results so the first word $onematch[0] can be isolated
            # $& is the entire matched string
            my @onematch = split('\t', $&);

                # OUTPUT RESULTS
                #if the matched noun is not empty and it is part of the word list
                if ($onematch[0] ne "" && grep( /^$onematch[0]$/, @nouns_contained_in_list )) { 
                print "$onematch[0]\t$single_regexp\n";
                } # END OUTPUT RESULTS
            } #END if the corpus line analyzed in this cycle matches the regexp
    } #END foreach regex
} #END go throught the lines of the corpus, one by one

# Untie the source corpus file
untie @raw_corpus_data;

regex - 如何使用正则表达式（在 Perl 或 *nix 终端中）匹配庞大语料库中列表中的单词？

2 回答 2

Related

Reference