python - 正则表达式从 CSV 中删除双引号

Question

我有一个 excel 表，它在一列中有很多数据，以来自 sql 数据库的 python 字典的形式。我无权访问原始数据库，也无法使用本地 infile 命令将 CSV 导入 sql，因为 CSV 每一行上的键/值的顺序不同。当我将 excel 表导出到 CSV 时，我得到：

"{""first_name"":""John"",""last_name"":""Smith"",""age"":30}"
"{""first_name"":""Tim"",""last_name"":""Johnson"",""age"":34}"

删除大括号前后的“以及键/值周围的额外”的最佳方法是什么？

我还需要单独留下没有引号的整数。

我正在尝试使用 json 模块将其导入python，以便我可以打印特定的键，但我不能用双引号导入它们。我最终需要保存在文件中的数据，如下所示：

{"first_name":"John","last_name":"Smith","age":30}
{"first_name":"Tim","last_name":"Johnson","age":34}

非常感谢任何帮助！

score 2 · Accepted Answer

简单的：

`text = re.sub(r'"(?!")', '', text)`

给定输入文件：TEST.TXT：

"{""first_name"":""John"",""last_name"":""Smith"",""age"":30}"
"{""first_name"":""Tim"",""last_name"":""Johnson"",""age"":34}"

剧本：

import re
f = open("TEST.TXT","r")
text_in = f.read()
text_out = re.sub(r'"(?!")', '', text_in)
print(text_out)

产生以下输出：

{"first_name":"John","last_name":"Smith","age":30}
{"first_name":"Tim","last_name":"Johnson","age":34}

score 2 · Accepted Answer

这应该这样做：

with open('old.csv') as old, open('new.csv', 'w') as new:
    new.writelines(re.sub(r'"(?!")', '', line) for line in old)

score 1 · Accepted Answer

You can actual use the csv module and regex to do this:

st='''\
"{""first_name"":""John"",""last_name"":""Smith"",""age"":30}"
"{""first_name"":""Tim"",""last_name"":""Johnson"",""age"":34}"\
'''

import csv, re

data=[]
reader=csv.reader(st, dialect='excel')
for line in reader:
    data.extend(line)

s=re.sub(r'(\w+)',r'"\1"',''.join(data))
s=re.sub(r'({[^}]+})',r'\1\n',s).strip()
print s

Prints

{"first_name":"John","last_name":"Smith","age":"30"}
{"first_name":"Tim","last_name":"Johnson","age":"34"}

score 1 · Accepted Answer

我觉得你想多了这个问题，为什么不替换数据？

l = list()
with open('foo.txt') as f:
    for line in f:
        l.append(line.replace('""','"').replace('"{','{').replace('}"','}'))
s = ''.join(l)

print s # or save it to file

它生成：

{"first_name":"John","last_name":"Smith","age":30}
{"first_name":"Tim","last_name":"Johnson","age":34}

使用 alist存储中间行，然后调用.join以提高性能，如附加到字符串的好方法中所述

score 1 · Accepted Answer

如果输入文件如图所示，并且您提到的文件很小，您可以将整个文件加载到内存中，进行替换，然后保存。恕我直言，您不需要 RegEx 来执行此操作。最容易阅读的代码是：

with open(filename) as f:
    input= f.read()
input= str.replace('""','"')
input= str.replace('"{','{')
input= str.replace('}"','}')
with open(filename, "w") as f:
    f.write(input)

我用样本输入对其进行了测试，它产生了：

{"first_name":"John","last_name":"Smith","age":30}
{"first_name":"Tim","last_name":"Johnson","age":34}

这正是你想要的。

如果你愿意，你也可以打包代码并编写

with open(inputFilename) as if:
    with open(outputFilename, "w") as of:
        of.write(if.read().replace('""','"').replace('"{','{').replace('}"','}'))

但我认为第一个更清晰，两者都完全相同。

python - 正则表达式从 CSV 中删除双引号

5 回答 5

text = re.sub(r'"(?!")', '', text)

Related

Reference

`text = re.sub(r'"(?!")', '', text)`