python - utf-16-le BOM csv files

Question

I'm downloading some CSV files from playstore (stats etc) and want to process with python.

cromestant@jumphost-vpc:~/stat_dev/bime$ file -bi stats/installs/*
text/plain; charset=utf-16le
text/plain; charset=utf-16le
text/plain; charset=utf-16le
text/plain; charset=utf-16le
text/plain; charset=utf-16le
text/plain; charset=utf-16le

As you can see they are utf-16le.

I have some code on python 2.7 that works on some files and not on others:

import codecs
.
.
fp =codecs.open(dir_n+'/'+file_n,'r',"utf-16")
 for line in fp:
  #write to mysql db

This works until:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in position 10: ordinal not in range(128)

What is the proper way to do this? I've seen "re encode" use cvs module etc. but csv module does not handle encoding by itself, so it seems overkill for just dumping to a database

score 4 · Accepted Answer

你试过codecs.EncodedFile吗？

with open('x.csv', 'rb') as f:
    g = codecs.EncodedFile(f, 'utf8', 'utf-16le', 'ignore')
    c = csv.reader(g)
    for row in c:
        print row
        # and if you want to use unicode instead of str:
        row = [unicode(cell, 'utf8') for cell in row]

score 3 · Accepted Answer

这样做的正确方法是什么？

正确的方法是使用 Python3，其中 Unicode 支持要合理得多。

作为一种解决方法，如果您出于某种原因对 Python3 过敏，最好的折衷方案是 wrap csv.reader()，如下所示：

import codecs
import csv

def to_utf8(fp):
    for line in fp:
        yield line.encode("utf-8")

def from_utf8(fp):
    for line in fp:
        yield [column.decode('utf-8') for column in line]

with codecs.open('utf16le.csv','r', 'utf-16le') as fp:
    reader = from_utf8(csv.reader(to_utf8(fp)))
    for line in reader:
        #"line" is a list of unicode strings
        #write to mysql db
        print line

python - utf-16-le BOM csv files

2 回答 2

Related

Reference