regex - sed - 删除大型 csv 文件中引号内的引号

Question

我正在使用流编辑器 sed 将大量文本文件数据 (400MB) 转换为 csv 格式。

我已经非常接近完成，但突出的问题是引号中的引号，在这样的数据上：

1,word1,"description for word1","another text",""text contains "double quotes" some more text"
2,word2,"description for word2","another text","text may not contain double quotes, but may contain commas ,"
3,word3,"description for "word3"","another text","more text and more"

所需的输出是：

1,word1,"description for word1","another text","text contains double quotes some more text"
2,word2,"description for word2","another text","text may not contain double quotes, but may contain commas ,"
3,word3,"description for word3","another text","more text and more"

我四处寻找帮助，但我并没有太接近解决方案，我尝试了以下带有正则表达式模式的 seds：

sed -i 's/(?<!^\s*|,)""(?!,""|\s*$)//g' *.txt
sed -i 's/(?<=[^,])"(?=[^,])//g' *.txt

这些来自以下问题，但似乎不适用于 sed：

perl 的相关问题

SISS的相关问题

原始文件是 *.txt，我正在尝试使用 sed 编辑它们。

score 2 · Accepted Answer

这是使用FPAT变量GNU awk的一种方法：

gawk 'BEGIN { FPAT="([^,]+)|(\"[^\"]+\")"; OFS=","; N="\"" } { for (i=1;i<=NF;i++) if ($i ~ /^\".*\"$/) { gsub(/\"/,"", $i); $i=N $i N } }1' file

结果：

1,word1,"description for word1","another text","text contains double
quotes some more text" 2,word2,"description for word2","another
text","text may not contain double quotes, but may contain commas ,"
3,word3,"description for word3","another text","more text and more"

解释：

使用 FPAT，字段被定义为“非逗号的任何内容”或“双引号、非双引号的任何内容以及结束双引号”。然后在输入的每一行，遍历每个字段，如果字段以双引号开始和结束，则从字段中删除所有引号。最后，在字段周围添加双引号。

score 1 · Accepted Answer

sed -e ':r s:["]\([^",]*\)["]\([^",]*\)["]\([^",]*\)["]:"\1\2\3":; tr' FILE

这会查看类型的字符串"STR1 "STR2" STR3 "并将它们转换为"STR1 STR2 STR3". 如果它找到了一些东西，它会重复，以确保它消除深度 > 2 的所有嵌套字符串。

它还确保没有 STRx 包含comma.

regex - sed - 删除大型 csv 文件中引号内的引号

2 回答 2

Related

Reference