python - Python3：处理 CSV 输出中的 UTF8 不兼容字符

Question

我在 Python3.2 上并且有一个 SQL 输出，我正在写入带有“名称”标识符和“细节”的 CSV 文件。对于来自中国的一些数据，正在插入人名（以及汉字）。我已经尽我所能阅读了 unicode/decoding 文档，但我不知道如何在我的 Python 中整体内联地重组/删除这些字符。

我正在像这样运行文件：

import csv, os, os.path
rfile = open(nonbillabletest2.csv,'r',newline='')
dataread= csv.reader(rfile)
trash=next(rfile) #ignores the header line in csv:

#Process the target CSV by creating an output with a unique filename per CompanyName
for line in dataread:
    [CompanyName,Specifics] = line
    #Check that a target csv does not exist
    if os.path.exists('test leads '+CompanyName+'.csv') < 1:
        wfile= open('test leads '+CompanyName+'.csv','a')
        datawrite= csv.writer(wfile, lineterminator='\n')
        datawrite.writerow(['CompanyName','Specifics']) #write new header row in each file created
        datawrite.writerow([CompanyName,Specifics])
wfile.close()    
rfile.close()

我收到此错误：

Traceback (most recent call last):
  File "C:\Users\Matt\Dropbox\nonbillable\nonbillabletest.py", line 26, in <module>
    for line in dataread:
  File "C:\Python32\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 1886: character maps to <undefined>

检查文件内容，显然是一些非 UTF8 字符：

print(repr(open('nonbillabletest2.csv', 'rb').read()))

b'CompanyName,Specifics\r\neGENTIC,\x86\xac\xff; \r\neGENTIC,\x86\xac\xff; \r\neGENTIC,
\x86\xac\xff; \r\neGENTIC,\x91\x9d?; \r\neGENTIC,\x86\xac\xff; \r\n'

合并“encoding=utf8”并不能解决问题。我已经能够使用 ...replace('\x86\xac\xff', '')) 删除单个字符，但是我必须对我可以遇到的每个字符都执行此操作，这效率不高。

如果有一个 SQL 解决方案也可以。请帮忙！

更新：我已经按照建议使用 string.printable 删除了字符。我又犯了一个错误，因为“内容”部分总是有最后一行。但是，添加 if len=0 检查可以解决此问题。

非常感谢您的快速帮助！

score 1 · Accepted Answer

所以 nonbillabletest2.csv 没有以 UTF-8 编码。

你可以：

在上游修复它。确保它像您期望的那样正确编码为 UTF-8。这可能是您所指的“SQL 解决方案”。

事先删除所有非ASCII字符（对于纯粹主义者来说，这会破坏数据，但根据你所说的，你似乎可以接受）

import csv, os, string
rfile = open('nonbillabletest2.csv', 'rb')
rbytes = rfile.read()
rfile.close()

contents = ''
for b in rbytes:
  if chr(b) in string.printable + string.whitespace:
    contents += chr(b)

dataread = csv.reader(contents.split('\r\n'))
....

python - Python3：处理 CSV 输出中的 UTF8 不兼容字符

1 回答 1

Related

Reference