perl - 解析 CSV 文件、查找列并记住它们

Question

我正在尝试找出一种方法来做到这一点，我知道这应该是可能的。先说一点背景。

我想自动化创建 NCBI Sequin 块以将 DNA 序列提交到 GenBank 的过程。我总是最终创建一个表格，其中列出了物种名称、标本 ID 值、序列类型，最后是集合的位置。我很容易将其导出到制表符分隔的文件中。现在我做这样的事情：

while ($csv) {
  foreach ($_) {
    if ($_ =! m/table|species|accession/i) {
      @csv = split('\t', $csv);
      print NEWFILE ">[species=$csv[0]] [molecule=DNA] [moltype=genomic] [country=$csv[2]] [spec-id=$csv[1]]\n";
    }
    else {
      next;
    }
  }
}

我知道这很乱，我只是输入了一些类似于我记忆中的内容（在我家里的任何计算机上都没有脚本，只有在工作中）。

现在这对我来说很好，因为我知道我需要的信息（物种、位置和 ID 号）在哪些列中。

但是有没有办法（必须有）让我动态地找到所需信息的列？也就是说，无论列的顺序如何，来自正确列的正确信息都会出现在正确的位置？

第一行通常为表 X（其中 X 是出版物中表的编号），下一行通常有感兴趣的列标题，并且标题几乎是通用的。几乎所有表格都有标准标题可供搜索，我可以使用 | 在我的模式匹配中。

score 3 · Accepted Answer

首先，如果我不推荐优秀的Text::CSV_XS模块，我会失职；它在读取 CSV 文件方面做得更可靠，甚至可以处理 Barmar 上面提到的列映射方案。

也就是说，Barmar 的方法是正确的，尽管它完全忽略了“Table X”行是一个单独的行。我建议采取一种明确的方法，也许是这样的（为了让事情清楚，这将有更多的细节；我可能会在生产代码中更紧密地编写它）：

# Assumes the file has been opened and that the filehandle is stored in $csv_fh.
# Get header information first.

my $hdr_data = {};

while( <$csv_fh> ) {
  if( ! $hdr_data->{'table'} && /Table (\d+)/ ) {
    $hdr_data->{'table'} = $1;
    next;
  }
  if( ! $hdr_data->{'species'} && /species/ ) {
    my $n = 0;
    # Takes the column headers as they come, creating
    # a map between the column name and column number.
    # Assumes that column names are case-insensitively
    # unique.
    my %columns = map { lc($_) => $n++ } split( /\t/ );
    # Now pick out exactly the columns we want.
    foreach my $thingy ( qw{ species accession country } ) {
      $hdr_data->{$thingy} = $columns{$thingy};
    }
    last;
  }
}

# Now process the rest of the lines.

while( <$csv_fh> ) {
  my $col = split( /\t/ );
  printf NEWFILE ">[species=%s] [molecule=DNA] [moltype=genomic] [country=%s] [spec-id=%s]\n",
    $col[$hdr_data->{'species'}],
    $col[$hdr_data->{'country'}],
    $col[$hdr_data->{'accession'}];
}

一些变化会让你接近你需要的东西。

score 1 · Accepted Answer

创建一个将列标题映射到列号的哈希：

my %columns;
...

if (/table|species|accession/i) {
  my @headings = split('\t');
  my $col = 0;
  foreach my $col (@headings) {
    $columns{"\L$col"} = $col++;
  }
}

然后你可以使用$csv[$columns{'species'}].

perl - 解析 CSV 文件、查找列并记住它们

2 回答 2

Related

Reference