perl - 在perl中需要在^A之前和之后从文件中提取单词

Question

我有很多日志文件cancel_log1，例如cancel_log2...

所有文件都包含这样的日志

2013/05/08 17:09:18 -0700 766 | 1368058158 | 22991 | yapache | cancelfeedback | INFO | File: /home/y/share/UNI/sites/order/cache/views/root:order:bucket=A:cancelservice:cache Function: () line: 450 Online^Atoo_expensive^AMoney is tight. I would to keep the service but I don&#39;t have the money at this time. Maybe I can come back in the future.^Asecuresanctuary.org^AWeb Hosting^Asecuresanctuary^A05/09/2009^A05/08/2013    
2013/05/07 17:45:35 -0700 219 | 1367973935 | 23388 | yapache | cancelfeedback | INFO | File: /home/y/share/UNI/sites/order/cache/views/root:order:bucket=A:cancelservice:cache Function: () line: 450 Online^Aother^AYahoo China service close^Alifesig.com^AWeb Hosting^Akennethli2005^A05/10/2008^A05/07/2013    
2013/05/08 17:30:57 -0700 115 | 1368059457 | 22982 | yapache | cancelfeedback | INFO | File: /home/y/share/UNI/sites/order/cache/views/root:order:bucket=A:cancelservice:cache Function: () line: 450 Online^Atoo_expensive^A^Asecuresanctuary.org^AWeb Hosting^Asecuresanctuary^A05/09/2009^A05/08/2013    
2013/05/07 17:59:38 -0700 694 | 1367974778 | 23381 | yapache | cancelfeedback | INFO | File: /home/y/share/UNI/sites/order/cache/views/root:order:cancelstep5:cache Function: () line: 436 Online^Amissing_feature^A^Achuanqisf244baidu.com^ADomains^Achuanqisf244baidu^A05/07/2013^A05/07/2013    
2013/05/08 17:33:03 -0700 815 | 1368059583 | 23000 | yapache | cancelfeedback | INFO | File: /home/y/share/UNI/sites/order/cache/views/root:order:cancelstep5:cache Function: () line: 436 Online^Aretired^A^Asisterthrifty.com^ADomains^Atrinaboice^A08/09/2005^A05/08/2013    
2013/05/07 17:59:40 -0700 231 | 1367974780 | 23389 | yapache | cancelfeedback | INFO | File: /home/y/share/UNI/sites/order/cache/views/root:order:cancelstep5:cache Function: () line: 436 Online^Amissing_feature^A^Achuanqisf239baidu.com^ADomains^Achuanqisf239baidu^A05/07/2013^A05/07/2013

我想提取分隔的单词^A并写入 CSV 文件。

例如我的输出文件将是这样的：

missing_feature chuanqisf239baidu.com Domains chuanqisf239baidu

任何帮助表示赞赏。

score 2 · Accepted Answer

您可以轻松地在字段上拆分并在^A之后过滤掉数据。我只是选取了您表示感兴趣的列范围，并在用逗号连接它们之前添加了一些引用逻辑。

while ( <> ) {
    say join( ',', map { index( $_, ',' ) > -1 ? qq/"$_"/ : $_ } @{[ split /\^A/ ]}[1..5] );
}

要将其分解为更多步骤，如下所示。

我使用“菱形运算符”，因为如果提取数据是主要问题，您不需要我为您编写文件处理代码。我将它用于通用输入循环。
所以我们split像这样：split /\^A/，它给了我们一个列表。
然后，我们通过在切片表达式中执行操作来获取该列表的切片。如果您有一个数组@a，@a[2..4]则可以只提取您感兴趣的元素。@{[ split /\^A/ ]}“数组表达式”也是如此，并且@{[ split /\^A/ ]}[1..5]是该数组的一部分。
但它和其他任何列表一样，所以将它放在一个map表达式中，我们检查它是否在字段中有逗号，如果有，我们将它用双引号 ( qq/"$_"/) 括起来，如果没有，我们就将它作为本身返回。
然后我们简单地使用join在每个字段之间插入一个逗号，我们say就得到了字符串。

然而，上面的方法是做 CSV 的糟糕方法，它只是做了一半。在真正的 CSV 输出中，如果您引用一个字段，您必须处理任何可能的嵌入引号。

所以有了Text::CSV，那就是：

use Text::CSV;

my $csv 
    = Text::CSV->new ( 
        { binary      => 1
        , quote_space => 0 
        } ) 
   or die "Cannot use CSV: ".Text::CSV->error_diag ();

while ( <> ) { 
    $csv->print( \*STDOUT, [ @{[ split /\^A/ ]}[1..5] ] );
    print "\n";
}

score 1 · Accepted Answer

这个简单的程序似乎可以满足您的需求。它期望输入文件的名称作为命令行上的参数。

use strict;
use warnings;

my $date = qr|^[0-9]{2}/[0-9]{2}/[0-9]{4}\s*$|;

while ( <DATA> ) {
  my @fields = split /\^A/;
  shift @fields;
  pop @fields while $fields[-1] =~ $date;
  print join(',', @fields), "\n";
}

如果您的字段曾经包含逗号，那么它们需要被引用，并且您应该print用这个替换该行

print join(',', map { /,/ ? '"'.s/"/\\"/gr . '"' : $_ } @fields), "\n";

其中引用包含逗号的行并转义这些字段可能包含的任何引号。

输出

too_expensive,Money is tight. I would to keep the service but I don&#39;t have the money at this time. Maybe I can come back in the future.,securesanctuary.org,Web Hosting,securesanctuary
too_expensive,Money is tight. I would to keep the service but I don&#39;t have the money at this time. Maybe I can come back in the future.,securesanctuary.org,Web Hosting,securesanctuary
other,Yahoo China service close,lifesig.com,Web Hosting,kennethli2005
other,Yahoo China service close,lifesig.com,Web Hosting,kennethli2005
too_expensive,,securesanctuary.org,Web Hosting,securesanctuary
too_expensive,,securesanctuary.org,Web Hosting,securesanctuary
missing_feature,,chuanqisf244baidu.com,Domains,chuanqisf244baidu
missing_feature,,chuanqisf244baidu.com,Domains,chuanqisf244baidu
retired,,sisterthrifty.com,Domains,trinaboice
retired,,sisterthrifty.com,Domains,trinaboice
missing_feature,,chuanqisf239baidu.com,Domains,chuanqisf239baidu
missing_feature,,chuanqisf239baidu.com,Domains,chuanqisf239baidu

score 0 · Accepted Answer

这是另一种选择：

use strict;
use warnings;

my @words;
while (<>) {
    @words = /\^A(.+?)\^A/g and print +( join ',', @words ) . "\n";
}

用法：perl script.pl inFile [>outFile]

最后一个可选参数将输出定向到文件。

数据集上的输出：

too_expensive,securesanctuary.org,securesanctuary
other,lifesig.com,kennethli2005
too_expensive,securesanctuary.org,securesanctuary
missing_feature,chuanqisf244baidu.com,chuanqisf244baidu
retired,sisterthrifty.com,trinaboice
missing_feature,chuanqisf239baidu.com,chuanqisf239baidu

该脚本使用正则表达式全局捕获^A每行 s 之间的文本，然后在获取结果join之前使用 ","捕获这些文本print。

and用作短路，因此仅print当单词已被捕获时才会发生 ing（没有空行）。

希望这可以帮助！

perl - 在perl中需要在^A之前和之后从文件中提取单词

3 回答 3

Related

Reference