0

我有一个从特定供应商的门户网站下载的.CSV文件(可以说是 tab_delimited_file.csv)。当我将文件移动到我的 Linux 目录之一时,我注意到这个特定的.CSV文件实际上是一个制表符分隔的文件,名为.CSV。请在下面找到该文件的几个示例记录。

"""column1"""   """column2"""   """column3"""   """column4"""   """column5"""   """column6"""   """column7"""  
12  455 string with quotes, and with a comma in between 4432    6787    890 88  
4432    6787    another, string with quotes, and with two comma in between  890 88  12  455  
11  22  simple string   77  777 333 22

上述样本记录由 . 分隔tabs。我知道文件的标题很奇怪,但这是我收到文件格式的方式。

我尝试使用tr命令来替换tabscommas但由于记录值中的额外逗号,文件完全搞砸了。我需要将带有逗号的记录值括在双引号中。我使用的命令如下。

tr '\t' ',' < tab_delimited_file.csv > comma_separated_file.csv    

这会将文件转换为以下格式。

"""column1""","""column2""","""column3""","""column4""","""column5""","""column6""","""column7"""
12,455,string with quotes, and with a comma in between,4432,6787,890,88
4432,6787,another, string with quotes, and with two comma in between,890,88,12,455
11,22,simple string,77,777,333,22

我需要帮助将示例文件转换为以下格式。

column1,column2,column3,column4,column5,column6,column7
12,455,"string with quotes, and with a comma in between",4432,6787,890,88
4432,6787,"another, string with quotes, and with two comma in between",890,88,12,455
11,22,"simple string",77,777,333,22

任何使用sedawk将非常有用的解决方案。

4

2 回答 2

2

这将产生您要求的输出,但尚不清楚我假设的条件是否正确,例如哪些字段放在引号中(任何包含逗号或空格)实际上是您想要的所以测试自己与其他输入一起查看:

$ awk 'BEGIN { FS=OFS="\t" }
  {
     gsub(/"/,"")
     for (i=1;i<=NF;i++)
         if ($i ~ /[,[:space:]]/)
             $i = "\"" $i "\""
     gsub(OFS,",")
     print
  }
  ' file
column1,column2,column3,column4,column5,column6,column7
12,455,"string with quotes, and with a comma in between",4432,6787,890,88
4432,6787,"another, string with quotes, and with two comma in between",890,88,12,455
11,22,"simple string",77,777,333,22
于 2013-10-02T16:41:58.453 回答
1

使用的一种方法:

awk '
    BEGIN { FS = "\t"; OFS = "," }
    FNR == 1 {
        for ( i = 1; i <= NF; i++ ) { gsub( /"+/, "", $i ) }
        print $0
        next
    }
    FNR > 1 {   
        for ( i = 1; i <= NF; i++ ) {
            w = split( $i, _, " " )
            if ( w > 1 ) { $i = "\"" $i "\"" }
        }
        print $0
    }
' infile

它使用制表符来分割输入中的字段,并使用逗号来写入输出。对于标题很简单,只需删除所有双引号即可。对于数据行,仅当拆分返回多个字段时,每个字段用空格分隔并用双引号括起来。

它产生:

column1,column2,column3,column4,column5,column6,column7  
12,455,"string with quotes, and with a comma in between",4432,6787,890,88  
4432,6787,"another, string with quotes, and with two comma in between",890,88,12,455  
11,22,"simple string",77,777,333,22
于 2013-10-02T16:19:27.413 回答