perl - Perl解析多个分隔符数据

Question

我有一个带有标题行的混合字符分隔文件，我正在尝试使用 Text::CSV 读取该文件，我已在逗号单独文件上成功使用该文件以将其拉入其他脚本中的哈希数组中。我已阅读 Text::CSV 不支持多个分隔符（空格、制表符、逗号），因此我在使用 Text::CSV 之前尝试使用正则表达式清理该行。更不用说数据文件在文件中间还有注释行。不幸的是，我没有管理员权限来安装可以容纳多个 sep_chars 的库，所以我希望我可以使用 Text::CSV 或其他一些标准方法来清理标题和行，然后再添加到 AoH。还是我应该放弃 Text::CSV？

我显然还在学习。提前致谢。

示例文件：

#
#
#
# name scale     address      type
test.data.one   32768       0x1234fde0      float
test.data.two   32768               0x1234fde4      float
test.data.the   32768       0x1234fde8      float
# comment lines in middle of data
test.data.for   32768                 0x1234fdec      float
test.data.fiv   32768       0x1234fdf0      float

代码摘录：

my $fh;
my $input;
my $header;
my $pkey;
my $row;
my %arrayofhashes;   

my $csv=Text::CSV({sep_char = ","})
    or die "Text::CSV error: " Text::CSV=error_diag;

open($fh, '<:encoding(UTF-8)', $input)
    or die "Can't open $input: $!";

while (<$fh>) {
    $line = $_;
    # skip to header row
    next if($line !~ /^# name/);
    # strip off leading chars on first column name
    $header =~ s/# //g;
    # replace multiple spaces and tabs with comma
    $header =~ s/ +/,/g;
    $header =~ s/t+/,/g;
    # results in $header = "name,scale,address,type"
    last;
}

my @header = split(",", $header);
$csv->parse($header);
$csv->column_names([$csv->fields]);
# above seems to work!

$pkey = 0;
while (<$fh>) {
    $line = $_;
    # skip comment lines
    next if ($line =~ /^#/);
    # replace spaces and tabs with commas
    $line =~ s/( +|\t+)/,/g;
    # replace multiple commas from previous regex with single comma    
    $line =~ s/,+/,/g;
    # results in $line = "test.data.one,32768,0x1234fdec,float"

    # need help trying to create a what I think needs to be a hash from the header and row.
    $row = ?????;
    # the following line works in my other perl scripts for CSV files when using:
    # while ($row = $csv->getline_hr($fh)) instead of the above.  
    $arrayofhashes{$pkey} = $row;
    $pkey++;
}

score 2 · Accepted Answer

如果您的列由多个空格分隔，则 Text::CSV 是无用的。您的代码包含大量重复代码，试图解决 Text::CSV 限制。

此外，您的代码风格不佳，包含多个语法错误和拼写错误，以及混淆的变量名称。

所以你想解析一个标题。

我们需要为我们的代码定义标题行。让我们以“第一个包含非空格字符的注释行”为例。它前面不能有非注释行。

use strict; use warnings; use autodie;

open my $fh, '<:encoding(UTF-8)', "filename.tsv";  # error handling by autodie

my @headers;
while (<$fh>) {
  # no need to copy to a $line variable, the $_ is just fine.
  chomp;                                     # remove line ending
  s/\A#\s*// or die "No header line found";  # remove comment char, or die
  /\S/ or next;                              # skip if there is nothing here
  @headers = split;                          # split the header names.
                                             # The `split` defaults to `split /\s+/, $_`
  last;                                      # break out of the loop: the header was found
}

字符类\s匹配空格字符（空格、制表符、换行符等）。是相反的\S并且匹配所有非空格字符。

其余的部分

现在我们有了标题名称，可以进行正常解析：

my @records;
while (<$fh>) {
  chomp;
  next if /\A#/;              # skip comments
  my @fields = split;
  my %hash;
  @hash{@headers} = @fields;  # use hash slice to assign fields to headers
  push @records, \%hash;      # add this hashref to our records
}

瞧。

结果

此代码从您的示例数据生成以下数据结构：

@records = (
  {
    address => "0x1234fde0",
    name    => "test.data.one",
    scale   => 32768,
    type    => "float",
  },
  {
    address => "0x1234fde4",
    name    => "test.data.two",
    scale   => 32768,
    type    => "float",
  },
  {
    address => "0x1234fde8",
    name    => "test.data.the",
    scale   => 32768,
    type    => "float",
  },
  {
    address => "0x1234fdec",
    name    => "test.data.for",
    scale   => 32768,
    type    => "float",
  },
  {
    address => "0x1234fdf0",
    name    => "test.data.fiv",
    scale   => 32768,
    type    => "float",
  },
);

这种数据结构可以像这样使用

for my $record (@records) {
  say $record->{name};
}

或者

for my $i (0 .. $#records) {
  say "$i: $records[$i]{name}";
}

对您的代码的批评

您在脚本顶部声明所有变量，有效地使它们成为全局变量。不。在尽可能小的范围内创建变量。我的代码在外部范围内只使用了三个变量$fh：@headers和@records.
这条线my $csv=Text::CSV({sep_char = ","})没有按预期工作。
- Text::CSV不是函数；它是模块的名称。你的意思是Text::CSV->new(...)。
- 选项应该是一个 hashref，但不幸的是sep_char = ","试图分配一些东西sep_char，这可能是有效的语法。但是您实际上是要指定键值关系。请改用=>运算符（称为粗逗号或哈希火箭）。
这也不起作用：or die "Text::CSV error: " Text::CSV=error_diag.
- 要连接字符串，请使用.连接运算符。你写的是一个语法错误：文字字符串后面总是跟着一个运算符。
- 你真的喜欢作业吗？Text::CSV=error_diag不起作用。您打算在该类上调用该error_diag方法。Text::CSV因此，请使用正确的运算符->: Text::CSV->error_diag。
替换用逗号s/t+/,/g替换所有 s 序列。t要替换制表符，请使用\tcharclass。
%arrayofhashes不是散列数组：它是散列（如%印记所示），但您使用整数作为键。数组有@印记。
要将某些内容添加到数组的末尾，我宁愿不要将最后一项的索引保留在额外的变量中。相反，使用该push函数将项目添加到末尾。这减少了簿记代码的数量。
如果你发现自己编写了一个类似的循环my $i = 0; while (condition) { do stuff; $i++}，那么你通常想要一个 C 风格的for循环：
```
for (my $i = 0; condition; $i++) {
  do stuff;
}
```
这也有助于正确确定变量的范围。

perl - Perl解析多个分隔符数据

1 回答 1

所以你想解析一个标题。

其余的部分

结果

对您的代码的批评

Related

Reference