-2

我有大量的数据转储,其结构如下

Key1:.............. Value
Key2:.............. Other value
Key3:.............. Maybe another value yet

Key1:.............. Different value
Key3:.............. Invaluable
Key5:.............. Has no value at all

我想将其转换为:

Key1,Key2,Key3,Key5
Value,Other value,Maybe another value yet,
Different value,,Invaluable,Has no value at all

我是说:

  • 生成所有键的集合
  • 生成包含所有键的标题行
  • 将所有值映射到它们正确的“列”(注意在这个例子中我没有“Key4”,并且 Key3/Key5 互换了)
  • 可能在 Perl 中,因为它更容易在各种环境中使用。

但我不确定这种格式是否不寻常,或者是否有工具已经这样做了。

4

3 回答 3

2

这很容易使用哈希和Text::CSV_XS模块:

use strict;
use warnings;

use Text::CSV_XS;

my @rows;
my %headers;

{
    local $/ = "";

    while (<DATA>) {
        chomp;
        my %record;

        for my $line (split(/\n/)) {
            next unless $line =~ /^([^:]+):\.+\s(.+)/;
            $record{$1} = $2;
            $headers{$1} = $1;
        }

        push(@rows, \%record);
    }
}

unshift(@rows, \%headers);

my $csv = Text::CSV_XS->new({binary => 1, auto_diag => 1, eol => $/});
$csv->column_names(sort(keys(%headers)));

for my $row_ref (@rows) {
    $csv->print_hr(*STDOUT, $row_ref);
}

__DATA__
Key1:.............. Value
Key2:.............. Other value
Key3:.............. Maybe another value yet

Key1:.............. Different value
Key3:.............. Invaluable
Key5:.............. Has no value at all

输出:

Key1,Key2,Key3,Key5
Value,"Other value","Maybe another value yet",
"Different value",,Invaluable,"Has no value at all"
于 2017-04-06T19:30:14.120 回答
0

如果您的 CSV 格式“复杂” - 例如它包含逗号等 - 然后使用其中一个Text::CSV模块。但如果不是这样——通常情况下——我倾向于只使用splitand join

在您的场景中有用的是,您可以使用正则表达式非常轻松地映射记录中的键值。然后使用哈希切片输出:

#!/usr/bin/env perl

use strict;
use warnings;

#set paragraph mode - records are blank line separated. 
local $/ = "";

my @rows;
my %seen_header;

#read STDIN or files on command line, just like sed/grep 
while ( <> ) {
   #multi - line pattern, that matches all the key-value pairs,
   #and then inserts them into a hash. 
   my %this_row = m/^(\w+):\.+ (.*)$/gm;
   push ( @rows, \%this_row ); 

   #add the keys we've seen to a hash, so we 'know' what we've seen. 
   $seen_header{$_}++ for keys %this_row; 
}

#extract the keys, make them unique and ordered. 
#could set this by hand if you prefer.    
my @header = sort keys %seen_header;

#print the header row
print join ",", @header, "\n";

#iterate the rows
foreach my $row ( @rows ) {
    #use a hash slice to select the values matching @header.
    #the map is so any undefined values (missing keys) don't report errors, they
    #just return blank fields. 
    print join ",", map { $_ // '' } @{$row}{@header},"\n";
}

这为您提供样本输入,产生:

Key1,Key2,Key3,Key5,
Value,Other value,Maybe another value yet,,
Different value,,Invaluable,Has no value at all,

如果你想变得非常聪明,那么循环的大部分初始构建都可以通过以下方式完成:

my @rows = map { { m/^(\w+):\.+ (.*)$/gm } } <>;

那么问题是 - 你仍然需要建立'headers'数组,这意味着更复杂一点:

$seen_header{$_}++ for map { keys %$_ } @rows;

它有效,但我认为发生的事情还不清楚。

但是,您的问题的核心可能是文件大小-这就是您遇到问题的地方,因为您需要两次读取文件-第一次是要找出整个文件中存在哪些标题,然后是第二次迭代和打印:

#!/usr/bin/env perl

use strict;
use warnings;

open ( my $input, '<', 'your_file.txt') or die $!;
local $/ = "";

my %seen_header;
while ( <$input> ) { 
    $seen_header{$_}++ for m/^(\w+):/gm; 
}  

my @header = sort keys %seen_header; 

#return to the start of file:
seek ( $input, 0, 0 ); 

while ( <$input> )  {
   my %this_row = m/^(\w+):\.+ (.*)$/gm;
   print join ",", map { $_ // '' } @{$this_row}{@header},"\n";
}

这会稍微慢一些,因为它必须读取文件两次。但它不会使用几乎一样多的内存占用,因为它没有将整个文件保存在内存中。

除非您事先知道所有密钥,并且您可以定义它们,否则您必须读取该文件两次。

于 2017-04-07T08:46:38.380 回答
-1

这似乎适用于您提供的数据

use strict;
use warnings 'all';

my %data;

while ( <> ) {

    next unless /^(\w+):\W*(.*\S)/;

    push @{ $data{$1} }, $2;
}

use Data::Dump;
dd \%data;

输出

{
  Key1 => ["Value", "Different value"],
  Key2 => ["Other value"],
  Key3 => ["Maybe another value yet", "Invaluable"],
  Key5 => ["Has no value at all"],
}
于 2017-04-06T18:48:58.140 回答