python - Python 2.6 中对 csv 文件的通用 Unicode/UTF-8 支持

Question

当涉及 UTF-8/Unicode 时，Python 中的 csv 模块无法正常工作。我在Python 文档和其他网页上发现了适用于特定情况的代码段，但您必须充分了解您正在处理的编码并使用适当的代码段。

如何从 Python 2.6 中“正常工作”的 .csv 文件读取和写入字符串和 Unicode 字符串？或者这是没有简单解决方案的 Python 2.6 的限制？

score 52 · Accepted Answer

http://docs.python.org/library/csv.html#examples上给出的如何读取 Unicode 的示例代码看起来已经过时，因为它不适用于 Python 2.6 和 2.7。

以下UnicodeDictReader是适用于 utf-8 并且可能适用于其他编码的方法，但我只在 utf-8 输入上对其进行了测试。

简而言之，只有在 csv 行被csv.reader.

class UnicodeCsvReader(object):
    def __init__(self, f, encoding="utf-8", **kwargs):
        self.csv_reader = csv.reader(f, **kwargs)
        self.encoding = encoding

    def __iter__(self):
        return self

    def next(self):
        # read and split the csv row into fields
        row = self.csv_reader.next() 
        # now decode
        return [unicode(cell, self.encoding) for cell in row]

    @property
    def line_num(self):
        return self.csv_reader.line_num

class UnicodeDictReader(csv.DictReader):
    def __init__(self, f, encoding="utf-8", fieldnames=None, **kwds):
        csv.DictReader.__init__(self, f, fieldnames=fieldnames, **kwds)
        self.reader = UnicodeCsvReader(f, encoding=encoding, **kwds)

用法（源文件编码为utf-8）：

csv_lines = (
    "абв,123",
    "где,456",
)

for row in UnicodeCsvReader(csv_lines):
    for col in row:
        print(type(col), col)

输出：

$ python test.py
<type 'unicode'> абв
<type 'unicode'> 123
<type 'unicode'> где
<type 'unicode'> 456

score 32 · Accepted Answer

32

答案有点晚，但我使用unicodecsv取得了巨大成功。

于 2012-04-23T05:34:07.377 回答

score 22 · Accepted Answer

此处提供的模块看起来像是 csv 模块的一个很酷、简单、直接的替代品，它允许您使用 utf-8 csv。

import ucsv as csv
with open('some.csv', 'rb') as f:
    reader = csv.reader(f)
    for row in reader:
        print row

score 7 · Accepted Answer

该文档中已经有Unicode示例的用法，为什么还需要找到另一个或重新发明轮子？

import csv

def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
    # csv.py doesn't do Unicode; encode temporarily as UTF-8:
    csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
                            dialect=dialect, **kwargs)
    for row in csv_reader:
        # decode UTF-8 back to Unicode, cell by cell:
        yield [unicode(cell, 'utf-8') for cell in row]

def utf_8_encoder(unicode_csv_data):
    for line in unicode_csv_data:
        yield line.encode('utf-8')

score 4 · Accepted Answer

我确认，unicodecsv是csv模块的一个很好的替代品，我刚刚在我的源代码中替换csv了unicodecsv，它就像一个魅力。

score 3 · Accepted Answer

The wrapper unicode_csv_reader mentioned in the python documentation accepts Unicode strings. This is because csv does not accept Unicode strings. cvs is probably not aware of encoding or locale and just treats the strings it gets as bytes. So what happens is that the wrapper encodes the Unicode strings, meaning that it creates a string of bytes. Then, when the wrapper gives back the results from csv, it decodes the bytes again, meaning that it converts the UTF-8 bytes sequences to the correct unicode characters.

If you give the wrapper a plain byte string e.g. by using f.readlines() it will give a UnicodeDecodeError on bytes with value > 127. You would use the wrapper in case you have unicode strings in your program that are in the CSV format.

I can imagine that the wrapper still has one limitation: since cvs does not accept unicode, and it also does not accept multi-byte delimiters, you can't parse files that have a unicode character as the delimiter.

score 2 · Accepted Answer

也许这很明显，但为了初学者，我会提到它。

在 python 3.X中，csv模块支持任何开箱即用的编码，所以如果你使用这个版本，你可以坚持使用标准模块。

 with open("foo.csv", encoding="utf-8") as f: 
     r = csv.reader(f, delimiter=";")
     for row in r: 
     print(row)

有关其他讨论，请参阅：Does python 3.1.3 support unicode in csv module？

score 1 · Accepted Answer

您应该考虑tablib，它具有完全不同的方法，但应该在“正常工作”要求下考虑。

with open('some.csv', 'rb') as f:
    csv = f.read().decode("utf-8")

import tablib
ds = tablib.Dataset()
ds.csv = csv
for row in ds.dict:
    print row["First name"]

警告：如果每行上的项目数量不同，tablib 将拒绝您的 csv。

score 1 · Accepted Answer

Here is an slightly improved version of Maxim's answer, which can also skip the UTF-8 BOM:

import csv
import codecs

class UnicodeCsvReader(object):
    def __init__(self, csv_file, encoding='utf-8', **kwargs):
        if encoding == 'utf-8-sig':
            # convert from utf-8-sig (= UTF8 with BOM) to plain utf-8 (without BOM):
            self.csv_file = codecs.EncodedFile(csv_file, 'utf-8', 'utf-8-sig')
            encoding = 'utf-8'
        else:
            self.csv_file = csv_file
        self.csv_reader = csv.reader(self.csv_file, **kwargs)
        self.encoding = encoding

    def __iter__(self):
        return self

    def next(self):
        # read and split the csv row into fields
        row = self.csv_reader.next() 
        # now decode
        return [unicode(cell, self.encoding) for cell in row]

    @property
    def line_num(self):
        return self.csv_reader.line_num

class UnicodeDictReader(csv.DictReader):
    def __init__(self, csv_file, encoding='utf-8', fieldnames=None, **kwds):
        reader = UnicodeCsvReader(csv_file, encoding=encoding, **kwds)
        csv.DictReader.__init__(self, reader.csv_file, fieldnames=fieldnames, **kwds)
        self.reader = reader

Note that the presence of the BOM is not automatically detected. You must signal it is there by passing the encoding='utf-8-sig' argument to the constructor of UnicodeCsvReader or UnicodeDictReader. Encoding utf-8-sig is utf-8 with a BOM.

score 0 · Accepted Answer

我会补充它的答案。默认情况下，excel 将 csv 文件保存为 latin-1（uccsv 不支持）。您可以通过以下方式轻松解决此问题：

with codecs.open(csv_path, 'rb', 'latin-1') as f:
    f = StringIO.StringIO( f.read().encode('utf-8') )

reader = ucsv.UnicodeReader(f)
# etc.

python - Python 2.6 中对 csv 文件的通用 Unicode/UTF-8 支持

10 回答 10

Related

Reference