0

我有很多日志文件cancel_log1,例如cancel_log2...

所有文件都包含这样的日志

2013/05/08 17:09:18 -0700 766 | 1368058158 | 22991 | yapache | cancelfeedback | INFO | File: /home/y/share/UNI/sites/order/cache/views/root:order:bucket=A:cancelservice:cache Function: () line: 450 Online^Atoo_expensive^AMoney is tight. I would to keep the service but I don't have the money at this time. Maybe I can come back in the future.^Asecuresanctuary.org^AWeb Hosting^Asecuresanctuary^A05/09/2009^A05/08/2013    
2013/05/07 17:45:35 -0700 219 | 1367973935 | 23388 | yapache | cancelfeedback | INFO | File: /home/y/share/UNI/sites/order/cache/views/root:order:bucket=A:cancelservice:cache Function: () line: 450 Online^Aother^AYahoo China service close^Alifesig.com^AWeb Hosting^Akennethli2005^A05/10/2008^A05/07/2013    
2013/05/08 17:30:57 -0700 115 | 1368059457 | 22982 | yapache | cancelfeedback | INFO | File: /home/y/share/UNI/sites/order/cache/views/root:order:bucket=A:cancelservice:cache Function: () line: 450 Online^Atoo_expensive^A^Asecuresanctuary.org^AWeb Hosting^Asecuresanctuary^A05/09/2009^A05/08/2013    
2013/05/07 17:59:38 -0700 694 | 1367974778 | 23381 | yapache | cancelfeedback | INFO | File: /home/y/share/UNI/sites/order/cache/views/root:order:cancelstep5:cache Function: () line: 436 Online^Amissing_feature^A^Achuanqisf244baidu.com^ADomains^Achuanqisf244baidu^A05/07/2013^A05/07/2013    
2013/05/08 17:33:03 -0700 815 | 1368059583 | 23000 | yapache | cancelfeedback | INFO | File: /home/y/share/UNI/sites/order/cache/views/root:order:cancelstep5:cache Function: () line: 436 Online^Aretired^A^Asisterthrifty.com^ADomains^Atrinaboice^A08/09/2005^A05/08/2013    
2013/05/07 17:59:40 -0700 231 | 1367974780 | 23389 | yapache | cancelfeedback | INFO | File: /home/y/share/UNI/sites/order/cache/views/root:order:cancelstep5:cache Function: () line: 436 Online^Amissing_feature^A^Achuanqisf239baidu.com^ADomains^Achuanqisf239baidu^A05/07/2013^A05/07/2013    

我想提取分隔的单词^A并写入 CSV 文件。

例如我的输出文件将是这样的:

missing_feature chuanqisf239baidu.com Domains chuanqisf239baidu

任何帮助表示赞赏。

4

3 回答 3

2

您可以轻松地在字段上拆分并在^A之后过滤掉数据。我只是选取了您表示感兴趣的列范围,并在用逗号连接它们之前添加了一些引用逻辑。

while ( <> ) {
    say join( ',', map { index( $_, ',' ) > -1 ? qq/"$_"/ : $_ } @{[ split /\^A/ ]}[1..5] );
}

要将其分解为更多步骤,如下所示。

  1. 我使用“菱形运算符”,因为如果提取数据是主要问题,您不需要我为您编写文件处理代码。我将它用于通用输入循环。

  2. 所以我们split像这样:split /\^A/,它给了我们一个列表。

  3. 然后,我们通过在切片表达式中执行操作来获取该列表的切片。如果您有一个数组@a@a[2..4]则可以只提取您感兴趣的元素。@{[ split /\^A/ ]}“数组表达式”也是如此,并且@{[ split /\^A/ ]}[1..5]是该数组的一部分。

  4. 但它和其他任何列表一样,所以将它放在一个map表达式中,我们检查它是否在字段中有逗号,如果有,我们将它用双引号 ( qq/"$_"/) 括起来,如果没有,我们就将它作为本身返回。

  5. 然后我们简单地使用join在每个字段之间插入一个逗号,我们say就得到了字符串。

然而,上面的方法是做 CSV 的糟糕方法,它只是做了一半。在真正的 CSV 输出中,如果您引用一个字段,您必须处理任何可能的嵌入引号。

所以有了Text::CSV,那就是:

use Text::CSV;

my $csv 
    = Text::CSV->new ( 
        { binary      => 1
        , quote_space => 0 
        } ) 
   or die "Cannot use CSV: ".Text::CSV->error_diag ();

while ( <> ) { 
    $csv->print( \*STDOUT, [ @{[ split /\^A/ ]}[1..5] ] );
    print "\n";
}
于 2013-05-10T12:30:58.303 回答
1

这个简单的程序似乎可以满足您的需求。它期望输入文件的名称作为命令行上的参数。

use strict;
use warnings;

my $date = qr|^[0-9]{2}/[0-9]{2}/[0-9]{4}\s*$|;

while ( <DATA> ) {
  my @fields = split /\^A/;
  shift @fields;
  pop @fields while $fields[-1] =~ $date;
  print join(',', @fields), "\n";
}

如果您的字段曾经包含逗号,那么它们需要被引用,并且您应该print用这个替换该行

print join(',', map { /,/ ? '"'.s/"/\\"/gr . '"' : $_ } @fields), "\n";

其中引用包含逗号的行并转义这些字段可能包含的任何引号。

输出

too_expensive,Money is tight. I would to keep the service but I don&#39;t have the money at this time. Maybe I can come back in the future.,securesanctuary.org,Web Hosting,securesanctuary
too_expensive,Money is tight. I would to keep the service but I don&#39;t have the money at this time. Maybe I can come back in the future.,securesanctuary.org,Web Hosting,securesanctuary
other,Yahoo China service close,lifesig.com,Web Hosting,kennethli2005
other,Yahoo China service close,lifesig.com,Web Hosting,kennethli2005
too_expensive,,securesanctuary.org,Web Hosting,securesanctuary
too_expensive,,securesanctuary.org,Web Hosting,securesanctuary
missing_feature,,chuanqisf244baidu.com,Domains,chuanqisf244baidu
missing_feature,,chuanqisf244baidu.com,Domains,chuanqisf244baidu
retired,,sisterthrifty.com,Domains,trinaboice
retired,,sisterthrifty.com,Domains,trinaboice
missing_feature,,chuanqisf239baidu.com,Domains,chuanqisf239baidu
missing_feature,,chuanqisf239baidu.com,Domains,chuanqisf239baidu
于 2013-05-10T14:08:22.497 回答
0

这是另一种选择:

use strict;
use warnings;

my @words;
while (<>) {
    @words = /\^A(.+?)\^A/g and print +( join ',', @words ) . "\n";
}

用法:perl script.pl inFile [>outFile]

最后一个可选参数将输出定向到文件。

数据集上的输出:

too_expensive,securesanctuary.org,securesanctuary
other,lifesig.com,kennethli2005
too_expensive,securesanctuary.org,securesanctuary
missing_feature,chuanqisf244baidu.com,chuanqisf244baidu
retired,sisterthrifty.com,trinaboice
missing_feature,chuanqisf239baidu.com,chuanqisf239baidu

该脚本使用正则表达式全局捕获^A每行 s 之间的文本,然后在获取结果join之前使用 ","捕获这些文本print

and用作短路,因此仅print当单词已被捕获时才会发生 ing(没有空行)。

希望这可以帮助!

于 2013-05-10T14:02:03.550 回答