regex - 正则表达式匹配相等的差异线

Question

我有差异，我后处理并想要弄平相等的线条。这是一个例子：

Foo
-Bar
+Bar
Baz

我想把相等的线压扁，这样它们就不会再出现在差异中了。这很简单

-(.*)\n\+\1\n

当我有多行匹配时，问题就开始了，例如：

-Foo
-Bar
+Foo
+Bar

有任何想法吗？还是我不应该做一个正则表达式并编写一个简单的解析器？还是已经存在？

如果有更好的解决方案，一些背景故事。我正在比较两个文件以查看它们是否相同。遗憾的是，输出几乎相同，但需要一些后处理，例如

-on line %d
+on line 8

所以我正在经历并将已知字符串转换为其他已知字符串，然后我试图检查差异是否为空或仍然不同。

score 0 · Accepted Answer

我之前对diff输出进行了一些更简单的分析，所以我有一个 Perl 脚本，它为我提供了一个开始的基础。考虑以下两个数据文件，file.1和file.2.

文件.1

Data

Foo
Bar 1
Baz

I want to squash the lines down that are equal so they don't show up in the diff anymore. This is pretty simple with

-(.*)\n\+\1\n

The problems start when I have multi-line matches like:

Foo 2
Bar 2

Etc.

文件.2

Data

Foo
Bar 10
Baz

I want to squash the lines down that are equal so they don't show up in the diff anymore. This is pretty simple with

-(.*)\n\+\1\n

The problems start when I have multi-line matches like:

Foo 20
Bar 20

Etc.

原始差异输出

原始统一diff输出为：

--- file.1  2013-03-30 18:58:35.000000000 -0700
+++ file.2  2013-03-30 18:58:48.000000000 -0700
@@ -1,7 +1,7 @@
 Data

 Foo
-Bar 1
+Bar 10
 Baz

 I want to squash the lines down that are equal so they don't show up in the diff anymore. This is pretty simple with
@@ -10,7 +10,7 @@

 The problems start when I have multi-line matches like:

-Foo 2
-Bar 2
+Foo 20
+Bar 20

 Etc.

后处理输出

现在，经过后期处理，所有的数字字符串都被替换为##，所以后期处理的文件如下所示：

--- file.1  2013-03-30 18:58:35.000000000 -0700
+++ file.2  2013-03-30 18:58:48.000000000 -0700
@@ -1,7 +1,7 @@
 Data

 Foo
-Bar ##
+Bar ##
 Baz

 I want to squash the lines down that are equal so they don't show up in the diff anymore. This is pretty simple with
@@ -10,7 +10,7 @@

 The problems start when I have multi-line matches like:

-Foo ##
-Bar ##
+Foo ##
+Bar ##

 Etc.

这是程序的输入，它将分析差异是否仍然真实。

为了真正有用，我们必须隔离标题行（---和+++）并保留它们。对于开始的每个差异块@@，我们需要捕获和的相邻行，-并且+：

检查和的行数是否+相同-
检查行的内容是否与-行的内容相同+。
请记住，尽管示例数据没有显示它，但您可以在一个部分中包含多个块-和行。+@@
如果块中没有任何差异@@，则可以丢弃整个块。
如果存在差异，那么如果之前没有输出标题行，我们需要输出它们。
如果存在差异，则输出整个差异块。

冲洗并重复。

我为此选择的编程语言是 Perl。

checkdiffs.pl

#!/usr/bin/env perl
use strict;
use warnings;
use constant debug => 0;

my $file1;
my $file2;
my $header = 0;

OUTER:
while (my $line = <>)
{
    chomp $line;
    print "[$line]\n" if debug;
    if ($line =~ m/^--- /)
    {
        $file1 = $line;
        $file2 = <>;
        chomp $file2;
        print "[$file2]\n" if debug;
        if ($file2 !~ m/^\+\+\+ /)
        {
            print STDERR "Unexpected file identification lines\n";
            print STDERR "$file1\n";
            print STDERR "$file2\n";
            next OUTER;
        }
        $header = 0;    # Have not output file header yet

        my @lines;
        my $atline;

        last OUTER unless defined($line = <>);
INNER:
        while ($line =~ m/^@@ /)
        {
            chomp $line;
            print "@[$line]\n" if debug;
            $atline = $line;
            @lines  = ();

            while (defined($line = <>) && $line =~ m/^[- +]/)
            {
                chomp $line;
                print ":[$line]\n" if debug;
                push @lines, $line;
            }
            # Got a complete @@ block of diffs
            post_process($atline, @lines);

            last OUTER if !defined($line);
            next INNER if ($line =~ m/^@@ /);
            print STDERR "Unexpected input line: [$line]\n";
            last OUTER;
        }
    }
}

sub differences
{
    my($pref, $mref) = @_;
    my $pnum = scalar(@$pref);
    my $mnum = scalar(@$mref);
    print "-->> differences\n" if debug;
    return 0 if ($pnum == 0 && $mnum == 0);
    return 1 if ($pnum != $mnum);
    foreach my $i (0..($pnum-1))
    {
        my $pline = substr(${$pref}[$i], 1);
        my $mline = substr(${$mref}[$i], 1);
        return 1 if ($pline ne $mline);
    }
    print "<<-- differences\n" if debug;
    return 0;
}

sub post_process
{
    my($atline, @lines) = @_;

    print "-->> post_process\n" if debug;
    # Work out whether there are any differences left
    my @plines = ();    # +lines
    my @mlines = ();    # -lines
    my $diffs  = 0;
    my $ptype  = ' ';   # Previous line type

    foreach my $line (@lines)
    {
        print "---- $line\n" if debug;
        my ($ctype) = ($line =~ m/^(.)/);
        if ($ctype eq ' ')
        {
            if (($ptype eq '-' || $ptype eq '+') && differences(\@plines, \@mlines))
            {
                $diffs = 1;
                last;
            }
            @plines = ();
            @mlines = ();
        }
        elsif ($ctype eq '-')
        {
            push @mlines, $line;
        }
        elsif ($ctype eq '+')
        {
            push @plines, $line;
        }
        else
        {
            print STDERR "Unexpected input line format: $line\n";
            exit 1;
        }
        $ptype = $ctype;
    }

    $diffs = 1 if differences(\@plines, \@mlines);

    if ($diffs != 0)
    {
        # Print the block of differences, preceded by file header if necessary
        if ($header == 0)
        {
            print "$file1\n";
            print "$file2\n";
            $header = 1;
        }
        print "$atline\n";
        foreach my $line (@lines)
        {
            print "$line\n";
        }
    }

    print "<<-- post_process\n" if debug;
    return;
}

使用文件进行了测试data，并使用了三个变体：

$ perl checkdiffs.pl data
$ perl checkdiffs.pl data.0
--- file.1  2013-03-30 18:58:35.000000000 -0700
+++ file.2  2013-03-30 18:58:48.000000000 -0700
@@ -1,7 +1,7 @@
 Data

 Foo
-Bar #0
+Bar ##
 Baz

 I want to squash the lines down that are equal so they don't show up in the diff anymore. This is pretty simple with
$ perl checkdiffs.pl data.1
--- file.1  2013-03-30 18:58:35.000000000 -0700
+++ file.2  2013-03-30 18:58:48.000000000 -0700
@@ -10,7 +10,7 @@

 The problems start when I have multi-line matches like:

-Foo #0
-Bar ##
+Foo ##
+Bar ##

 Etc.
$ perl checkdiffs.pl data.2
--- file.1  2013-03-30 18:58:35.000000000 -0700
+++ file.2  2013-03-30 18:58:48.000000000 -0700
@@ -1,7 +1,7 @@
 Data

 Foo
-Bar #0
+Bar ##
 Baz

 I want to squash the lines down that are equal so they don't show up in the diff anymore. This is pretty simple with
@@ -10,7 +10,7 @@

 The problems start when I have multi-line matches like:

-Foo ##
-Bar #0
+Foo ##
+Bar ##

 Etc.
$

这符合你的要求吗？

score 0 · Accepted Answer

我认为这可能有效（除非您重复配对）：

   sed 's/^[-+]//' filename | perl -ne 'print unless $seen{$_}++'

将开始 +/- 替换为空字符串。然后只选择独特的线条。

score 0 · Accepted Answer

You could use s modifier and positive lookahead:

with s modifier you can also match new line with dot
with positive lookahead you can find occurances of the match without making it a part of the match (which skips everything in between...).

Here is sample matching at regexpal.

Here is C# regex sample that should be close to what you need:

var sourceString = @"-Foo
    +Foo
    la
    -Bar
    +Foo
    la
    -Ko
    +Bar
    la
    +Ko
    -Ena
    asdsda
    -Dva
    +Ena
    +Dva
    ";
Regex ItemRegex = new Regex(@"(?s)\-(.*?)\n(?=(.*?)(\+\1))", RegexOptions.Compiled);
foreach (Match ItemMatch in ItemRegex.Matches(sourceString))
{
    Console.WriteLine(ItemMatch);
}

regex - 正则表达式匹配相等的差异线

3 回答 3

文件.1

文件.2

原始差异输出

后处理输出

checkdiffs.pl

Related

Reference