当涉及 UTF-8/Unicode 时,Python 中的 csv 模块无法正常工作。我在Python 文档和其他网页上发现了适用于特定情况的代码段,但您必须充分了解您正在处理的编码并使用适当的代码段。
如何从 Python 2.6 中“正常工作”的 .csv 文件读取和写入字符串和 Unicode 字符串?或者这是没有简单解决方案的 Python 2.6 的限制?
http://docs.python.org/library/csv.html#examples上给出的如何读取 Unicode 的示例代码看起来已经过时,因为它不适用于 Python 2.6 和 2.7。
以下UnicodeDictReader
是适用于 utf-8 并且可能适用于其他编码的方法,但我只在 utf-8 输入上对其进行了测试。
简而言之,只有在 csv 行被csv.reader
.
class UnicodeCsvReader(object):
def __init__(self, f, encoding="utf-8", **kwargs):
self.csv_reader = csv.reader(f, **kwargs)
self.encoding = encoding
def __iter__(self):
return self
def next(self):
# read and split the csv row into fields
row = self.csv_reader.next()
# now decode
return [unicode(cell, self.encoding) for cell in row]
@property
def line_num(self):
return self.csv_reader.line_num
class UnicodeDictReader(csv.DictReader):
def __init__(self, f, encoding="utf-8", fieldnames=None, **kwds):
csv.DictReader.__init__(self, f, fieldnames=fieldnames, **kwds)
self.reader = UnicodeCsvReader(f, encoding=encoding, **kwds)
用法(源文件编码为utf-8):
csv_lines = (
"абв,123",
"где,456",
)
for row in UnicodeCsvReader(csv_lines):
for col in row:
print(type(col), col)
输出:
$ python test.py
<type 'unicode'> абв
<type 'unicode'> 123
<type 'unicode'> где
<type 'unicode'> 456
答案有点晚,但我使用unicodecsv取得了巨大成功。
此处提供的模块看起来像是 csv 模块的一个很酷、简单、直接的替代品,它允许您使用 utf-8 csv。
import ucsv as csv
with open('some.csv', 'rb') as f:
reader = csv.reader(f)
for row in reader:
print row
该文档中已经有Unicode示例的用法,为什么还需要找到另一个或重新发明轮子?
import csv
def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
# csv.py doesn't do Unicode; encode temporarily as UTF-8:
csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
dialect=dialect, **kwargs)
for row in csv_reader:
# decode UTF-8 back to Unicode, cell by cell:
yield [unicode(cell, 'utf-8') for cell in row]
def utf_8_encoder(unicode_csv_data):
for line in unicode_csv_data:
yield line.encode('utf-8')
我确认,unicodecsv
是csv
模块的一个很好的替代品,我刚刚在我的源代码中替换csv
了unicodecsv
,它就像一个魅力。
The wrapper unicode_csv_reader
mentioned in the python documentation accepts Unicode strings. This is because csv does not accept Unicode strings. cvs is probably not aware of encoding or locale and just treats the strings it gets as bytes. So what happens is that the wrapper encodes the Unicode strings, meaning that it creates a string of bytes. Then, when the wrapper gives back the results from csv, it decodes the bytes again, meaning that it converts the UTF-8 bytes sequences to the correct unicode characters.
If you give the wrapper a plain byte string e.g. by using f.readlines()
it will give a UnicodeDecodeError
on bytes with value > 127. You would use the wrapper in case you have unicode strings in your program that are in the CSV format.
I can imagine that the wrapper still has one limitation: since cvs does not accept unicode, and it also does not accept multi-byte delimiters, you can't parse files that have a unicode character as the delimiter.
也许这很明显,但为了初学者,我会提到它。
在 python 3.X中,csv
模块支持任何开箱即用的编码,所以如果你使用这个版本,你可以坚持使用标准模块。
with open("foo.csv", encoding="utf-8") as f:
r = csv.reader(f, delimiter=";")
for row in r:
print(row)
您应该考虑tablib,它具有完全不同的方法,但应该在“正常工作”要求下考虑。
with open('some.csv', 'rb') as f:
csv = f.read().decode("utf-8")
import tablib
ds = tablib.Dataset()
ds.csv = csv
for row in ds.dict:
print row["First name"]
警告:如果每行上的项目数量不同,tablib 将拒绝您的 csv。
Here is an slightly improved version of Maxim's answer, which can also skip the UTF-8 BOM:
import csv
import codecs
class UnicodeCsvReader(object):
def __init__(self, csv_file, encoding='utf-8', **kwargs):
if encoding == 'utf-8-sig':
# convert from utf-8-sig (= UTF8 with BOM) to plain utf-8 (without BOM):
self.csv_file = codecs.EncodedFile(csv_file, 'utf-8', 'utf-8-sig')
encoding = 'utf-8'
else:
self.csv_file = csv_file
self.csv_reader = csv.reader(self.csv_file, **kwargs)
self.encoding = encoding
def __iter__(self):
return self
def next(self):
# read and split the csv row into fields
row = self.csv_reader.next()
# now decode
return [unicode(cell, self.encoding) for cell in row]
@property
def line_num(self):
return self.csv_reader.line_num
class UnicodeDictReader(csv.DictReader):
def __init__(self, csv_file, encoding='utf-8', fieldnames=None, **kwds):
reader = UnicodeCsvReader(csv_file, encoding=encoding, **kwds)
csv.DictReader.__init__(self, reader.csv_file, fieldnames=fieldnames, **kwds)
self.reader = reader
Note that the presence of the BOM is not automatically detected. You must signal it is there by passing the encoding='utf-8-sig'
argument to the constructor of UnicodeCsvReader
or UnicodeDictReader
. Encoding utf-8-sig
is utf-8
with a BOM.
我会补充它的答案。默认情况下,excel 将 csv 文件保存为 latin-1(uccsv 不支持)。您可以通过以下方式轻松解决此问题:
with codecs.open(csv_path, 'rb', 'latin-1') as f:
f = StringIO.StringIO( f.read().encode('utf-8') )
reader = ucsv.UnicodeReader(f)
# etc.