1

我有差异,我后处理并想要弄平相等的线条。这是一个例子:

Foo
-Bar
+Bar
Baz

我想把相等的线压扁,这样它们就不会再出现在差异中了。这很简单

-(.*)\n\+\1\n

当我有多行匹配时,问题就开始了,例如:

-Foo
-Bar
+Foo
+Bar

有任何想法吗?还是我不应该做一个正则表达式并编写一个简单的解析器?还是已经存在?

如果有更好的解决方案,一些背景故事。我正在比较两个文件以查看它们是否相同。遗憾的是,输出几乎相同,但需要一些后处理,例如

-on line %d
+on line 8

所以我正在经历并将已知字符串转换为其他已知字符串,然后我试图检查差异是否为空或仍然不同。

4

3 回答 3

0

我之前对diff输出进行了一些更简单的分析,所以我有一个 Perl 脚本,它为我提供了一个开始的基础。考虑以下两个数据文件,file.1file.2.

文件.1

Data

Foo
Bar 1
Baz

I want to squash the lines down that are equal so they don't show up in the diff anymore. This is pretty simple with

-(.*)\n\+\1\n

The problems start when I have multi-line matches like:

Foo 2
Bar 2

Etc.

文件.2

Data

Foo
Bar 10
Baz

I want to squash the lines down that are equal so they don't show up in the diff anymore. This is pretty simple with

-(.*)\n\+\1\n

The problems start when I have multi-line matches like:

Foo 20
Bar 20

Etc.

原始差异输出

原始统一diff输出为:

--- file.1  2013-03-30 18:58:35.000000000 -0700
+++ file.2  2013-03-30 18:58:48.000000000 -0700
@@ -1,7 +1,7 @@
 Data

 Foo
-Bar 1
+Bar 10
 Baz

 I want to squash the lines down that are equal so they don't show up in the diff anymore. This is pretty simple with
@@ -10,7 +10,7 @@

 The problems start when I have multi-line matches like:

-Foo 2
-Bar 2
+Foo 20
+Bar 20

 Etc.

后处理输出

现在,经过后期处理,所有的数字字符串都被替换为##,所以后期处理的文件如下所示:

--- file.1  2013-03-30 18:58:35.000000000 -0700
+++ file.2  2013-03-30 18:58:48.000000000 -0700
@@ -1,7 +1,7 @@
 Data

 Foo
-Bar ##
+Bar ##
 Baz

 I want to squash the lines down that are equal so they don't show up in the diff anymore. This is pretty simple with
@@ -10,7 +10,7 @@

 The problems start when I have multi-line matches like:

-Foo ##
-Bar ##
+Foo ##
+Bar ##

 Etc.

这是程序的输入,它将分析差异是否仍然真实。

为了真正有用,我们必须隔离标题行(---+++)并保留它们。对于开始的每个差异块@@,我们需要捕获 和 的相邻行,-并且+

  1. 检查和的行数是否+相同-
  2. 检查行的内容是否与-行的内容相同+
  3. 请记住,尽管示例数据没有显示它,但您可以在一个部分中包含多个块-和行。+@@
  4. 如果块中没有任何差异@@,则可以丢弃整个块。
  5. 如果存在差异,那么如果之前没有输出标题行,我们需要输出它们。
  6. 如果存在差异,则输出整个差异块。

冲洗并重复。

我为此选择的编程语言是 Perl。

checkdiffs.pl

#!/usr/bin/env perl
use strict;
use warnings;
use constant debug => 0;

my $file1;
my $file2;
my $header = 0;

OUTER:
while (my $line = <>)
{
    chomp $line;
    print "[$line]\n" if debug;
    if ($line =~ m/^--- /)
    {
        $file1 = $line;
        $file2 = <>;
        chomp $file2;
        print "[$file2]\n" if debug;
        if ($file2 !~ m/^\+\+\+ /)
        {
            print STDERR "Unexpected file identification lines\n";
            print STDERR "$file1\n";
            print STDERR "$file2\n";
            next OUTER;
        }
        $header = 0;    # Have not output file header yet

        my @lines;
        my $atline;

        last OUTER unless defined($line = <>);
INNER:
        while ($line =~ m/^@@ /)
        {
            chomp $line;
            print "@[$line]\n" if debug;
            $atline = $line;
            @lines  = ();

            while (defined($line = <>) && $line =~ m/^[- +]/)
            {
                chomp $line;
                print ":[$line]\n" if debug;
                push @lines, $line;
            }
            # Got a complete @@ block of diffs
            post_process($atline, @lines);

            last OUTER if !defined($line);
            next INNER if ($line =~ m/^@@ /);
            print STDERR "Unexpected input line: [$line]\n";
            last OUTER;
        }
    }
}

sub differences
{
    my($pref, $mref) = @_;
    my $pnum = scalar(@$pref);
    my $mnum = scalar(@$mref);
    print "-->> differences\n" if debug;
    return 0 if ($pnum == 0 && $mnum == 0);
    return 1 if ($pnum != $mnum);
    foreach my $i (0..($pnum-1))
    {
        my $pline = substr(${$pref}[$i], 1);
        my $mline = substr(${$mref}[$i], 1);
        return 1 if ($pline ne $mline);
    }
    print "<<-- differences\n" if debug;
    return 0;
}

sub post_process
{
    my($atline, @lines) = @_;

    print "-->> post_process\n" if debug;
    # Work out whether there are any differences left
    my @plines = ();    # +lines
    my @mlines = ();    # -lines
    my $diffs  = 0;
    my $ptype  = ' ';   # Previous line type

    foreach my $line (@lines)
    {
        print "---- $line\n" if debug;
        my ($ctype) = ($line =~ m/^(.)/);
        if ($ctype eq ' ')
        {
            if (($ptype eq '-' || $ptype eq '+') && differences(\@plines, \@mlines))
            {
                $diffs = 1;
                last;
            }
            @plines = ();
            @mlines = ();
        }
        elsif ($ctype eq '-')
        {
            push @mlines, $line;
        }
        elsif ($ctype eq '+')
        {
            push @plines, $line;
        }
        else
        {
            print STDERR "Unexpected input line format: $line\n";
            exit 1;
        }
        $ptype = $ctype;
    }

    $diffs = 1 if differences(\@plines, \@mlines);

    if ($diffs != 0)
    {
        # Print the block of differences, preceded by file header if necessary
        if ($header == 0)
        {
            print "$file1\n";
            print "$file2\n";
            $header = 1;
        }
        print "$atline\n";
        foreach my $line (@lines)
        {
            print "$line\n";
        }
    }

    print "<<-- post_process\n" if debug;
    return;
}

使用文件进行了测试data,并使用了三个变体:

$ perl checkdiffs.pl data
$ perl checkdiffs.pl data.0
--- file.1  2013-03-30 18:58:35.000000000 -0700
+++ file.2  2013-03-30 18:58:48.000000000 -0700
@@ -1,7 +1,7 @@
 Data

 Foo
-Bar #0
+Bar ##
 Baz

 I want to squash the lines down that are equal so they don't show up in the diff anymore. This is pretty simple with
$ perl checkdiffs.pl data.1
--- file.1  2013-03-30 18:58:35.000000000 -0700
+++ file.2  2013-03-30 18:58:48.000000000 -0700
@@ -10,7 +10,7 @@

 The problems start when I have multi-line matches like:

-Foo #0
-Bar ##
+Foo ##
+Bar ##

 Etc.
$ perl checkdiffs.pl data.2
--- file.1  2013-03-30 18:58:35.000000000 -0700
+++ file.2  2013-03-30 18:58:48.000000000 -0700
@@ -1,7 +1,7 @@
 Data

 Foo
-Bar #0
+Bar ##
 Baz

 I want to squash the lines down that are equal so they don't show up in the diff anymore. This is pretty simple with
@@ -10,7 +10,7 @@

 The problems start when I have multi-line matches like:

-Foo ##
-Bar #0
+Foo ##
+Bar ##

 Etc.
$ 

这符合你的要求吗?

于 2013-03-31T03:52:43.713 回答
0

我认为这可能有效(除非您重复配对):

   sed 's/^[-+]//' filename | perl -ne 'print unless $seen{$_}++'

将开始 +/- 替换为空字符串。然后只选择独特的线条。

于 2013-03-31T04:35:29.927 回答
0

You could use s modifier and positive lookahead:

  • with s modifier you can also match new line with dot
  • with positive lookahead you can find occurances of the match without making it a part of the match (which skips everything in between...).

Here is sample matching at regexpal.

Here is C# regex sample that should be close to what you need:

var sourceString = @"-Foo
    +Foo
    la
    -Bar
    +Foo
    la
    -Ko
    +Bar
    la
    +Ko
    -Ena
    asdsda
    -Dva
    +Ena
    +Dva
    ";
Regex ItemRegex = new Regex(@"(?s)\-(.*?)\n(?=(.*?)(\+\1))", RegexOptions.Compiled);
foreach (Match ItemMatch in ItemRegex.Matches(sourceString))
{
    Console.WriteLine(ItemMatch);
}
于 2013-03-31T13:08:22.100 回答