regex - 正则表达式代码

Question

我需要在两个制表符分隔的文件之间找到匹配项，如下所示：

文件 1：

ID1  1  65383896    65383896    G   C  PCNXL3
ID1  2  56788990        55678900        T       A  ACT1 
ID1  1   56788990       55678900       T       A  PRO55

文件 2

ID2 34    65383896   65383896       G   C  MET5
ID2  2   56788990       55678900       T       A  ACT1 
ID2  2   56788990       55678900       T       A  HLA

我想做的是检索两个文件之间的匹配行。我想匹配的是基因ID之后的一切

到目前为止，我已经编写了这段代码，但不幸的是 perl 一直给我错误：使用“在模式匹配中使用未初始化的值 (m//)”

你能帮我弄清楚我在哪里做错了吗？

先感谢您！

use strict;

open (INA, $ARGV[0]) || die "cannot to open gene file";
open (INB, $ARGV[1]) || die "cannot to open coding_annotated.var files";

my @sample1 = <INA>;
my @sample2 = <INB>;

foreach my $line (@sample1) {
    my @tab = split (/\t/, $line);

    my $chr   = $tab[1];
    my $start = $tab[2];
    my $end   = $tab[3];
    my $ref   = $tab[4];
    my $alt   = $tab[5];
    my $name  = $tab[6];

    foreach my $item (@sample2){
        my @fields = split (/\t/,$item);

        if (   $fields[1] =~ m/$chr(.*)/
            && $fields[2] =~ m/$start(.*)/
            && $fields[4] =~ m/$ref(.*)/
            && $fields[5] =~ m/$alt(.*)/
            && $fields[6] =~ m/$name(.*)/
        ) {     
            print  $line, "\n", $item;
        }
    }
}

score 1 · Accepted Answer

从表面上看，您的代码似乎很好（尽管我没有调试它）。如果您没有我无法发现的错误，可能是输入数据具有 RE 特殊字符，当您按原样放置时会混淆正则表达式引擎（例如，如果任何变量具有“$”字符） . 也可能是你在某些地方有空格而不是制表符，在这种情况下你确实会得到一个错误，因为你的拆分会失败。

无论如何，最好只编写一个包含所有字段的正则表达式。我下面的代码有点 Perl 惯用语。我喜欢使用隐式 $_ 在我看来这使代码更具可读性。我刚刚用您的输入文件对其进行了测试，它完成了这项工作。

use strict;

open (INA, $ARGV[0]) or die "cannot open file 1";
open (INB, $ARGV[1]) or die "cannot open file 2";

my @sample1 = <INA>;
my @sample2 = <INB>;


foreach (@sample1) {
    (my $id, my $chr, my $start, my $end, my $ref, my $alt, my $name) =
        m/^(ID\d+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)/;
    my $rex = "^ID\\d+\\s+$chr\\s+$start\\s+$end\\s+$ref\\s+$alt\\s+$name\\s+";
    #print "$rex\n";
    foreach (@sample2) {
        if( m/$rex/ ) {
            print "$id - $_";
        }
    }
}

另外，输入数据的规律性如何？字段之间是否只有一个选项卡？如果是这种情况，将行拆分为 7 个不同的字段是没有意义的 - 您只需要两个：行的 ID 部分，以及其余部分。第一个正则表达式是

(my $id, my $restOfLine) = m/^(ID\d+)\s+(.*)$/;

您正在使用与上述类似的技术在第二个文件中搜索 $restOfLine。

如果您的文件很大并且性能是一个问题，您应该考虑将第一个正则表达式（或字符串）放入映射中。这将为您提供 O(n*log(m)) 其中 n 和 m 是每个文件中的行数。

最后，当我需要比较日志时，我遇到了类似的挑战。日志应该是相同的，除了每行开头的时间标记。但更重要的是：大多数行都是相同且有序的。如果这是您所拥有的，并且对您有意义，您可以：

首先从每一行中删除 IDxxx：perl -pe "s/ID\d+ +//" file >cleanfile
然后使用 BeyondCompare 或 Windiff 比较文件。

score 1 · Accepted Answer

我玩了一下你的代码。你在那里写的实际上是三个循环：

超过第一个文件的行，
超过第二个文件的行，以及
这些行中的所有字段。您手动展开了此循环。

该答案的其余部分假定文件是严格用制表符分隔的，并且任何其他空格都很重要（即使在字段和行的末尾）。

这是代码的精简版本（假设打开文件句柄$file1、$file2和use strict）：

my @sample2 = <$file2>;

SAMPLE_1:
foreach my $s1 (<$file1>) {
    my (undef, @fields1) = split /\t/, $s1;
    my @regexens = map qr{\Q$_\E(.*)}, @fields1;

    SAMPLE_2:
    foreach my $s2 (@sample2) {
        my (undef, @fields2) = split /\t/, $s2;
        for my $i (0 .. $#regexens) {
            $fields2[$i] =~ $regexens[$i] or next SAMPLE_2;
        }
        # only gets here if all regexes matched
        print $s1, $s2;
    }
}

我做了一些优化：预编译各种正则表达式并将它们存储在一个数组中，引用字段的内容等。但是，这个算法是O(n²)，这很糟糕。

这是该算法的一个优雅变体，它知道只有第一个字段不同——该行的其余部分必须与character 的字符相同：

my @sample2 = <$file2>;

foreach my $s1 (<$file1>) {
    foreach my $s2 (@sample2) {
        print $s1, $s2 if (split /\t/, $s1, 2)[1] eq (split /\t/, $s2, 2)[1];
    }
}

我只是测试该行其余部分的字符串相等性。虽然这个算法仍然是O(n²)，但它仅仅通过在这里避免脑死亡的正则表达式，它就比第一个解决方案大约好一个数量级。

最后，这是一个O(n)的解决方案。它是前一个的变体，但在一个之后执行循环，而不是在彼此内部，因此在线性时间内完成。我们使用哈希：

# first loop via map    
my %seen = map {reverse(split /\t/, $_, 2)}
           # map {/\S/ ? $_ : () } # uncomment this line to handle empty lines
           <$file1>;

# 2nd loop
foreach my $line (<$file2>) {
    my ($id2, $key) = split /\t/, $line, 2;
    if (defined (my $id1 = $seen{$key})) {
        print "$id1\t$key";
        print "$id2\t$key";
    }
}

%seen是一个哈希，该行的其余部分作为键，第一个字段作为值。在第二个循环中，我们再次检索该行的其余部分。如果该行存在于第一个文件中，我们将重建整行并将其打印出来。该解决方案比其他解决方案更好，并且由于其线性复杂性而可以很好地向上和向下扩展

score 0 · Accepted Answer

怎么样：

#!/usr/bin/perl

use File::Slurp;
use strict;

my ($ina, $inb) = @ARGV;

my @lines_a = File::Slurp::read_file($ina);
my @lines_b = File::Slurp::read_file($inb);

my $table_b = {};

my $ln = 0;

# Store all lines in second file in a hash with every different value as a hash key
# If there are several identical ones we store them also, so the hash values are lists  containing the id and line number
foreach (@lines_b) {
    chomp; # strip newlines
    $ln++; # count current line number 
    my ($id, $rest) = split(m{[\t\s]+}, $_, 2); # split on whitespaces, could be too many tabs or spaces instead
    if (exists $table_b->{$rest}) {
        push @{ $table_b->{$rest} }, [$id, $ln]; # push to existing list if we already found an entry that is the same
    } else {
        $table_b->{$rest} = [ [$id, $ln] ]; # create new entry if this is the first one
    }
} 

# Go thru first file and print out all matches we might have
$ln = 0;
foreach (@lines_a) {
    chomp;
    $ln++; 
    my ($id, $rest) = split(m{[\t\s]+}, $_, 2);
    if (exists $table_b->{$rest}) { # if we have this entry print where it is found
        print "$ina:$ln:\t\t'$id\t$rest'\n " . (join '\n ', map { "$inb:$_->[1]:\t\t'$_->[0]\t$rest'" } @{ $table_b->{$rest} }) . "\n";
    }
}

regex - 正则表达式代码

3 回答 3

Related

Reference