regex - Perl：使用拆分但忽略引号

Question

我正在尝试从输入字符串创建 Perl 哈希，但我遇到了原始“拆分”的问题，因为值可能包含引号。下面是一个示例输入字符串，以及我的（期望的）结果哈希：

my $command = 'CREATE:USER:TEL,12345678:MOB,444001122:Type,Whatever:ATTRIBUTES,"ID,0,MOB,123,KEY,VALUE":TIME,"08:01:59":FIN,0';

my %hash = 
  (
   CREATE     => '',
   USER       => '',
   TEL        => '12345678',
   MOB        => '444001122',
   Type       => 'Whatever',
   ATTRIBUTES => 'ID,0,MOB,123,KEY,VALUE',
   TIME       => '08:01:59',
   FIN        => '0',
  );

输入字符串为任意长度，未设置键数。

谢谢！

-hq

score 5 · Accepted Answer

~~使用Text::CSV。它正确处理逗号分隔的值文件。~~

更新

标准模块似乎无法解析您的输入格式，即使使用sep_charand 也是如此allow_loose_quotes。因此，您必须自己完成繁重的工作，但您仍然可以使用 Text::CSV 来解析每个键值对：

#!/usr/bin/perl
use warnings;
use strict;
use feature qw(say);

use Data::Dumper;

use Text::CSV;

my $command = 'CREATE:USER:TEL,12345678:MOB,444001122:Type,Whatever:ATTRIBUTES,"ID,0,KEY,VALUE":TIME,"08:01:59":FIN,0';

my @fields = split /:/, $command;
my %hash;
my $csv = Text::CSV->new();

my $i = 0;
while ($i <= $#fields) {
    if (1 == $fields[$i] =~ y/"//) {
        my $j = $i;
        $fields[$i] .= ':' . $fields[$j] until 1 == $fields[++$j] =~ y/"//;
        $fields[$i] .= ':' . $fields[$j];
        splice @fields, $i + 1, $j - $i, ();
    }
    $csv->parse($fields[$i]);
    my ($key, $value) = $csv->fields;
    $hash{$key} = "$value"; # quotes turn undef to q()
    $i++;
}

print Dumper \%hash;

score 3 · Accepted Answer

据我所见，最明显的候选者 - Text::CSV- 不能正确处理这种格式，所以只有一个本土的正则表达式解决方案。

use strict;
use warnings;

my $command = 'CREATE:USER:TEL,12345678:MOB,444001122:Type,Whatever:ATTRIBUTES,"ID,0,KEY,VALUE":TIME,"08:01:59":FIN,0';

my %config;
for my $field ($command =~ /(?:"[^"]*"|[^:])+/g) {
  my ($key, $val) = split /,/, $field, 2;
  ($config{$key} = $val // '') =~ s/"([^"]*)"/$1/;
}

use Data::Dumper;
print Data::Dumper->Dump([\%config], ['*config']);

输出

%config = (
            'TIME' => '08:01:59',
            'MOB' => '444001122',
            'Type' => 'Whatever',
            'CREATE' => '',
            'TEL' => '12345678',
            'ATTRIBUTES' => 'ID,0,KEY,VALUE',
            'USER' => '',
            'FIN' => '0'
          );

如果你有 Perl v5.10 或更高版本，那么你有方便的(?| ... )正则表达式组，它允许你写这个

use 5.010;
use warnings;

my $command = 'CREATE:USER:TEL,12345678:MOB,444001122:Type,Whatever:ATTRIBUTES,"ID,0,KEY,VALUE":TIME,"08:01:59":FIN,0';

my %config = $command =~ /(\w+) (?| , " ([^"]*) " | , ([^:"]*) | () )/gx;

use Data::Dumper;
print Data::Dumper->Dump([\%config], ['*config']);

这会产生与上面的代码相同的结果。

score 2 · Accepted Answer

这看起来像是Text::ParseWords可以处理的。该quotewords子例程将拆分分隔符上的输入:，忽略引号内的分隔符。这将为我们提供基本的项目列表，在输出中首先显示为$VAR1. 之后，使用正则表达式解析逗号分隔的项目是一件简单的事情，该正则表达式将处理可选的第二次捕获以容纳空标签，例如 forCREATE和USER.

use strict;
use warnings;
use Data::Dumper;
use Text::ParseWords;

while (<DATA>) {
    chomp;
    my @list = quotewords(':', 0, $_);
    my %hash = map { my ($k, $v) = /([^,]+),?(.*)/; $k => $v; } @list;
    print Dumper \@list, \%hash;
}

__DATA__
CREATE:USER:TEL,12345678:MOB,444001122:Type,Whatever:ATTRIBUTES,"ID,0,KEY,VALUE":TIME,"08:01:59":FIN,0

输出：

$VAR1 = [
          'CREATE',
          'USER',
          'TEL,12345678',
          'MOB,444001122',
          'Type,Whatever',
          'ATTRIBUTES,ID,0,KEY,VALUE',
          'TIME,08:01:59',
          'FIN,0'
        ];
$VAR2 = {
          'TIME' => '08:01:59',
          'MOB' => '444001122',
          'Type' => 'Whatever',
          'CREATE' => '',
          'TEL' => '12345678',
          'ATTRIBUTES' => 'ID,0,KEY,VALUE',
          'USER' => '',
          'FIN' => '0'
        };

score 0 · Accepted Answer

my %hash = $command =~ /([^:,]+)(?:,((?:[^:"]|"[^"]*")*))?/g;
s/"([^"]*)"/$1/g
   for grep defined, values %hash;

regex - Perl：使用拆分但忽略引号

4 回答 4

更新

Related

Reference