bash - 使用 gawk 解析 CSV 文件

Question

如何使用 gawk 解析 CSV 文件？简单的设置FS=","是不够的，因为里面有逗号的引用字段将被视为多个字段。

FS=","使用which 不起作用的示例：

文件内容：

one,two,"three, four",five
"six, seven",eight,"nine"

gawk 脚本：

BEGIN { FS="," }
{
  for (i=1; i<=NF; i++) printf "field #%d: %s\n", i, $(i)
  printf "---------------------------\n"
}

输出不良：

field #1: one
field #2: two
field #3: "three
field #4:  four"
field #5: five
---------------------------
field #1: "six
field #2:  seven"
field #3: eight
field #4: "nine"
---------------------------

所需的输出：

field #1: one
field #2: two
field #3: "three, four"
field #4: five
---------------------------
field #1: "six, seven"
field #2: eight
field #3: "nine"
---------------------------

score 15 · Accepted Answer

gawk 第 4 版手册说要使用FPAT = "([^,]*)|(\"[^\"]+\")"

FPAT定义时，它将禁用并按FS内容而不是按分隔符指定字段。

score 13 · Accepted Answer

简短的回答是“如果 CSV 包含笨拙的数据，我不会使用 gawk 来解析 CSV”，其中 'awkward' 表示 CSV 字段数据中的逗号之类的东西。

下一个问题是“您将要进行哪些其他处理”，因为这会影响您使用的替代方案。

我可能会使用 Perl 和 Text::CSV 或 Text::CSV_XS 模块来读取和处理数据。请记住，Perl 最初部分是作为awk和sed杀手编写的——因此，a2p和s2p程序仍然与 Perl 一起分发，它们将awk和sed脚本（分别）转换为 Perl。

score 4 · Accepted Answer

如果允许，我会使用 Python csv模块，特别注意使用的方言和所需的格式参数，来解析您拥有的 CSV 文件。

score 4 · Accepted Answer

您可以使用一个名为 csvquote 的简单包装函数来清理输入并在 awk 完成处理后将其恢复。在开始和结束时通过管道传输您的数据，一切都应该正常：

前：

gawk -f mypgoram.awk input.csv

后：

csvquote input.csv | gawk -f mypgoram.awk | csvquote -u

有关代码和文档，请参阅https://github.com/dbro/csvquote。

score 2 · Accepted Answer

csv2delim.awk

# csv2delim.awk converts comma delimited files with optional quotes to delim separated file
#     delim can be any character, defaults to tab
# assumes no repl characters in text, any delim in line converts to repl
#     repl can be any character, defaults to ~
# changes two consecutive quotes within quotes to '

# usage: gawk -f csv2delim.awk [-v delim=d] [-v repl=`"] input-file > output-file
#       -v delim    delimiter, defaults to tab
#       -v repl     replacement char, defaults to ~

# e.g. gawk -v delim=; -v repl=` -f csv2delim.awk test.csv > test.txt

# abe 2-28-7
# abe 8-8-8 1.0 fixed empty fields, added replacement option
# abe 8-27-8 1.1 used split
# abe 8-27-8 1.2 inline rpl and "" = '
# abe 8-27-8 1.3 revert to 1.0 as it is much faster, split most of the time
# abe 8-29-8 1.4 better message if delim present

BEGIN {
    if (delim == "") delim = "\t"
    if (repl == "") repl = "~"
    print "csv2delim.awk v.m 1.4 run at " strftime() > "/dev/stderr" ###########################################
}

{
    #if ($0 ~ repl) {
    #   print "Replacement character " repl " is on line " FNR ":" lineIn ";" > "/dev/stderr"
    #}
    if ($0 ~ delim) {
        print "Temp delimiter character " delim " is on line " FNR ":" lineIn ";" > "/dev/stderr"
        print "    replaced by " repl > "/dev/stderr"
    }
    gsub(delim, repl)

    $0 = gensub(/([^,])\"\"/, "\\1'", "g")
#   $0 = gensub(/\"\"([^,])/, "'\\1", "g")  # not needed above covers all cases

    out = ""
    #for (i = 1;  i <= length($0);  i++)
    n = length($0)
    for (i = 1;  i <= n;  i++)
        if ((ch = substr($0, i, 1)) == "\"")
            inString = (inString) ? 0 : 1 # toggle inString
        else
            out = out ((ch == "," && ! inString) ? delim : ch)
    print out
}

END {
    print NR " records processed from " FILENAME " at " strftime() > "/dev/stderr"
}

测试.csv

"first","second","third"
"fir,st","second","third"
"first","sec""ond","third"
" first ",sec   ond,"third"
"first" , "second","th  ird"
"first","sec;ond","third"
"first","second","th;ird"
1,2,3
,2,3
1,2,
,2,
1,,2
1,"2",3
"1",2,"3"
"1",,"3"
1,"",3
"","",""
"","""aiyn","oh"""
"""","""",""""
11,2~2,3

测试.bat

rem test csv2delim
rem default is: -v delim={tab} -v repl=~
gawk                      -f csv2delim.awk test.csv > test.txt
gawk -v delim=;           -f csv2delim.awk test.csv > testd.txt
gawk -v delim=; -v repl=` -f csv2delim.awk test.csv > testdr.txt
gawk            -v repl=` -f csv2delim.awk test.csv > testr.txt

score 1 · Accepted Answer

我不确定这是否是正确的做事方式。我宁愿处理一个 csv 文件，其中所有值都被引用或没有。顺便说一句，awk 允许正则表达式成为字段分隔符。检查这是否有用。

score 1 · Accepted Answer

{
  ColumnCount = 0
  $0 = $0 ","                           # Assures all fields end with comma
  while($0)                             # Get fields by pattern, not by delimiter
  {
    match($0, / *"[^"]*" *,|[^,]*,/)    # Find a field with its delimiter suffix
    Field = substr($0, RSTART, RLENGTH) # Get the located field with its delimiter
    gsub(/^ *"?|"? *,$/, "", Field)     # Strip delimiter text: comma/space/quote
    Column[++ColumnCount] = Field       # Save field without delimiter in an array
    $0 = substr($0, RLENGTH + 1)        # Remove processed text from the raw data
  }
}

遵循这一模式的模式可以访问 Column[] 中的字段。ColumnCount 指示 Column[] 中找到的元素数。如果不是所有行都包含相同数量的列，则 Column[] 在处理较短的行时在 Column[ColumnCount] 之后包含额外的数据。

此实现速度很慢，但它似乎模拟了先前答案中提到的 gawk >= 4.0.0 中的FPAT/功能。patsplit()

参考

score 0 · Accepted Answer

这就是我想出的。任何意见和/或更好的解决方案将不胜感激。

BEGIN { FS="," }
{
  for (i=1; i<=NF; i++) {
    f[++n] = $i
    if (substr(f[n],1,1)=="\"") {
      while (substr(f[n], length(f[n]))!="\"" || substr(f[n], length(f[n])-1, 1)=="\\") {
        f[n] = sprintf("%s,%s", f[n], $(++i))
      }
    }
  }
  for (i=1; i<=n; i++) printf "field #%d: %s\n", i, f[i]
  print "----------------------------------\n"
}

基本思想是我遍历字段，任何以引号开头但不以引号结尾的字段都会获得附加到它的下一个字段。

score 0 · Accepted Answer

Perl 有 Text::CSV_XS 模块，该模块专门用于处理带引号的逗号怪异。
或者尝试 Text::CSV 模块。

perl -MText::CSV_XS -ne 'BEGIN{$csv=Text::CSV_XS->new()} if($csv->parse($_)){@f=$csv->fields();for $n (0..$#f) {print "field #$n: $f[$n]\n"};print "---\n"}' file.csv

产生这个输出：

field #0: one
field #1: two
field #2: three, four
field #3: five
---
field #0: six, seven
field #1: eight
field #2: nine
---

这是一个人类可读的版本。
将其另存为 parsecsv，chmod +x，并将其作为“parsecsv file.csv”运行

#!/usr/bin/perl
use warnings;
use strict;
use Text::CSV_XS;
my $csv = Text::CSV_XS->new();
open(my $data, '<', $ARGV[0]) or die "Could not open '$ARGV[0]' $!\n";
while (my $line = <$data>) {
    if ($csv->parse($line)) {
        my @f = $csv->fields();
        for my $n (0..$#f) {
            print "field #$n: $f[$n]\n";
        }
        print "---\n";
    }
}

您可能需要在您的机器上指向不同版本的 perl，因为您的默认 perl 版本上可能没有安装 Text::CSV_XS 模块。

Can't locate Text/CSV_XS.pm in @INC (@INC contains: /home/gnu/lib/perl5/5.6.1/i686-linux /home/gnu/lib/perl5/5.6.1 /home/gnu/lib/perl5/site_perl/5.6.1/i686-linux /home/gnu/lib/perl5/site_perl/5.6.1 /home/gnu/lib/perl5/site_perl .).
BEGIN failed--compilation aborted.

如果您的 Perl 版本都没有安装 Text::CSV_XS，您需要：
sudo apt-get install cpanminus
sudo cpanm Text::CSV_XS

bash - 使用 gawk 解析 CSV 文件

9 回答 9

csv2delim.awk

测试.csv

测试.bat

Related

Reference