2

我有一个看起来像的文件:

SECTION1 id name  
 sub section1
 sub section2
SECTION2 id name  
 sub section3
 sub section4
 sub section6
SECTION1 id name  
 sub section7
 sub section8
SECTION3 id name  
 sub section9
 sub section10
 sub section11
 sub section12
SECTION2 id name  
 sub section13
 sub section14
SECTION1 id name  
 sub section15
 sub section16
SECTION3 id name  
 sub section17
 sub section18

我需要明智地对该文件部分进行排序。我唯一知道的是我有“SECTION1”、“SECTION2”和“SECTION3”。排序后的预期输出为:

SECTION1 id name  
 sub section1
 sub section2
SECTION1 id name  
 sub section7
 sub section8
SECTION1 id name  
 sub section15
 sub section16
SECTION2 id name  
 sub section3
 sub section4
 sub section6
SECTION2 id name  
 sub section13
 sub section14
SECTION3 id name  
 sub section9
 sub section10
 sub section11
 sub section12
SECTION3 id name  
 sub section17
 sub section18

有没有简单的方法在 perl 中或使用 grep、sed 等实用程序来做到这一点?

4

5 回答 5

3

另一种使用方法perl

假设infile有问题的内容和以下内容script.pl

use warnings;
use strict;
use sort qw/stable/;

my ($section, @section);

while ( <> ) { 

    ## Save text if first line or when line doesn't begin with 'SECTION' word.
    if ( $. == 1 || $_ !~ m/\ASECTION\d+/ ) { 
        $section .= $_; 
        next unless eof;
    }   

    ## Save the text and the number of section.
    if ( $section =~ m/\ASECTION(\d+)/ ) { 
        push @section, [ $1, $section ];
        $section = q||;
    }   

    ## Begin to save next section.
    $section .= $_; 
}

## Print them sorted by section number.
for ( sort { $a->[0] <=> $b->[0] } @section ) { 
    printf qq|%s|, $_->[1];
}

像这样运行它:

perl script.pl infile

具有以下输出:

SECTION1 id name  
 sub section1
 sub section2
SECTION1 id name  
 sub section7
 sub section8
SECTION1 id name  
 sub section15
 sub section16
SECTION2 id name  
 sub section3
 sub section4
 sub section6
SECTION2 id name  
 sub section13
 sub section14
SECTION3 id name  
 sub section9
 sub section10
 sub section11
 sub section12
SECTION3 id name  
 sub section17
 sub section18
于 2012-06-24T17:43:09.057 回答
3

看起来需要特殊排序的东西。Perl 的默认排序不能正确地对带有数字的字符串进行排序,因此我们需要在排序之前提取数字。对于大数据集,我使用Schwartzian 变换对其进行了优化。

它的基本要点是先提取节号,然后是小节号,然后先按节号排序,如果出现平局,则按小节号排序。仅考虑小节中的第一个数字,因此假定这些行已经排序。

要在文件上使用它,只需更改<DATA><>,然后运行:

perl script.pl inputfile > outputfile

代码:

use strict;
use warnings;

local $/;           # read entire file
my $data = <DATA>;  # slurp input file into scalar
my @records = split /(?=^SECTION)/m, $data;  # split into records
my @sorted =    map  {  $_->[0] }
                sort {  $a->[1] <=> $b->[1] ||
                        $a->[2] <=> $b->[2] }  
                map   { getnum($_) } @records;   # Schwartzian transform sort
print @sorted;

sub getnum {    # extract section and subsection numbers
    my ($sec) = $_[0] =~ /SECTION(\d+)/;
    my ($sub) = $_[0] =~ /\n.*?(\d+)/;
    return [ $_[0], $sec, $sub ];    # return anonymous array
}

__DATA__
SECTION1 id name  
 sub section1
 sub section2
SECTION2 id name  
 sub section3
 sub section4
 sub section6
SECTION1 id name  
 sub section7
 sub section8
SECTION3 id name  
 sub section9
 sub section10
 sub section11
 sub section12
SECTION2 id name  
 sub section13
 sub section14
SECTION1 id name  
 sub section15
 sub section16
SECTION3 id name  
 sub section17
 sub section18
于 2012-06-24T18:29:38.333 回答
1
#!/usr/bin/perl
use strict;
use warnings;

my @data;
{   # limit change to $/ to this scope
    local $/ = "SECTION";
    @data = map {chomp; $_ || ()} <DATA>;   
}

{   # limit change to 'warnings' to this scope
    no warnings 'numeric';
    print "SECTION$_" for sort {$a <=> $b} @data;
}

这将保留各个部分。

或者从命令行:

perl -F/SECTION/ -0ane "print qq{SECTION$_} for grep $_, sort {$a <=> $b} @F" o33.txt
于 2012-06-24T17:46:54.537 回答
1

这可能对您有用(GNU sed):

sed ':a;$!N;/\nSECTION/!s/\n/\x00/;ta;s/n\([0-9][\x00\n]\|$\)/n0\1/g;P;D' file |
sort |
sed 's/\x00/\n/g;s/n0/n/g'

解释:

  • 加入SECTIONssub sections成单行。:a;$!N;/\nSECTION/!s/\n/\x00/;ta
  • 0's 添加到sub sections.s/n\([0-9][\x00\n]\|$\)/n0\1/g
  • 打印每一行然后删除它。P;D
  • 对管道输出进行排序。sort
  • 解构排序的输出。sed 's/\x00/\n/g;s/n0/n/g'
于 2012-06-24T17:55:54.090 回答
1

这可以通过根据节标签将记录累积在三个单独的列表中来非常简单地完成。

该程序使用哈希来执行此操作,并通过将文件中的每一行附加到最近的记录来构建完整的部分。如果该行是新部分的开头,则在附加该行之前将另一个空记录添加到列表中。

显示结果只需按照部分标签的顺序打印列表的所有元素。

use strict;
use warnings;

open my $fh, '<', 'sections.txt' or die $!;

my %sections;
my $current_list;

while (<$fh>) {
  if (/^(SECTION[123])/) {
    $current_list = $sections{$1} //= [];
    push @$current_list, '';
  }
  $current_list->[-1] .= $_ if $current_list;
}

for my $name (sort keys %sections) {
  print for @{ $sections{$name} };
}

输出

SECTION1 id name  
 sub section1
 sub section2
SECTION1 id name  
 sub section7
 sub section8
SECTION1 id name  
 sub section15
 sub section16
SECTION2 id name  
 sub section3
 sub section4
 sub section6
SECTION2 id name  
 sub section13
 sub section14
SECTION3 id name  
 sub section9
 sub section10
 sub section11
 sub section12
SECTION3 id name  
 sub section17
 sub section18
于 2012-06-24T19:52:56.143 回答