0

假设我们有一个逗号分隔的文件 (csv),如下所示:

"name of movie","starring","director","release year"
"dark knight rises","christian bale, anna hathaway","christopher nolan","2012"
"the dark knight","christian bale, heath ledger","christopher nolan","2008"
"The "day" when earth stood still","Michael Rennie,the 'strong' man","robert wise","1951"
"the 'gladiator'","russel "the awesome" crowe","ridley scott","2000"

从上面可以看出,在第 4 行和第 5 行中,引号中有引号。输出应如下所示:

"name of movie","starring","director","release year"
"dark knight rises","christian bale, anna hathaway","christopher nolan","2012"
"the dark knight","christian bale, heath ledger","christopher nolan","2008"
"The day when earth stood still","Michael Rennie,the strong man","robert wise","1951"
"the gladiator","russel the awesome crowe","ridley scott","2000"

如何摆脱出现在 csv 文件中这样的引号内的此类引号(单引号和双引号)。请注意,单个字段中的逗号是可以的,因为解析器会识别它在引号内并将其作为一个字段。这只是安排 csv 文件的预处理步骤,以便可以将其输入多个解析器以转换为我们想要的任何格式。Bash、awk、python 都可以。请不要 perl,我厌倦了那种语言:D 在此先感谢!

4

3 回答 3

4

怎么样

import csv

def remove_quotes(s):
    return ''.join(c for c in s if c not in ('"', "'"))

with open("fixquote.csv","rb") as infile, open("fixed.csv","wb") as outfile:
    reader = csv.reader(infile)
    writer = csv.writer(outfile, quoting=csv.QUOTE_ALL)
    for line in reader:
        writer.writerow([remove_quotes(elem) for elem in line])

产生

~/coding$ cat fixed.csv 
"name of movie","starring","director","release year"
"dark knight rises","christian bale, anna hathaway","christopher nolan","2012"
"the dark knight","christian bale, heath ledger","christopher nolan","2008"
"The day when earth stood still","Michael Rennie,the strong man","robert wise","1951"
"the gladiator","russel the awesome crowe","ridley scott","2000"

顺便说一句,您可能想检查其中一些名称的拼写。

于 2012-08-17T17:58:23.070 回答
1

使用 awk 您可以执行以下操作:

awk -v Q='"' '{ gsub("[\"']","") ; gsub(",",Q "," Q) ; print Q $0 Q }'
于 2012-08-17T18:08:14.830 回答
0

将值拆分为数组。遍历数组,删除除第一个和最后一个字符之外的所有引号。希望能帮助到你。

于 2012-08-17T17:52:35.960 回答