python - python编码utf-8

Question

我正在用 python 做一些脚本。我创建了一个保存在文件中的字符串。这个字符串有很多数据，来自目录的树状结构和文件名。根据 convmv，我所有的树状结构都是 UTF-8。

我想将所有内容保存在 UTF-8 中，因为之后我会将其保存在 MySQL 中。目前，在 UTF-8 格式的 MySQL 中，我遇到了一些字符问题（例如 é 或 è - 我是法国人）。

我希望 python 总是使用字符串作为 UTF-8。我在互联网上阅读了一些信息，我确实喜欢这个。

我的脚本以此开头：

 #!/usr/bin/python
 # -*- coding: utf-8 -*-
 def createIndex():
     import codecs
     toUtf8=codecs.getencoder('UTF8')
     #lot of operations & building indexSTR the string who matter
     findex=open('config/index/music_vibration_'+date+'.index','a')
     findex.write(codecs.BOM_UTF8)
     findex.write(toUtf8(indexSTR)) #this bugs!

当我执行时，这是答案：UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2171: ordinal not in range(128)

编辑：我看到，在我的文件中，口音写得很好。创建此文件后，我读取它并将其写入 MySQL。但我不明白为什么，但我遇到了编码问题。我的 MySQL 数据库是 utf8，或者似乎是 SQL 查询SHOW variables LIKE 'char%'只返回 utf8 或二进制。

我的功能如下所示：

#!/usr/bin/python
# -*- coding: utf-8 -*-

def saveIndex(index,date):
    import MySQLdb as mdb
    import codecs

    sql = mdb.connect('localhost','admin','*******','music_vibration')
    sql.charset="utf8"
    findex=open('config/index/'+index,'r')
    lines=findex.readlines()
    for line in lines:
        if line.find('#artiste') != -1:
            artiste=line.split('[:::]')
            artiste=artiste[1].replace('\n','')

            c=sql.cursor()
            c.execute('SELECT COUNT(id) AS nbr FROM artistes WHERE nom="'+artiste+'"')
            nbr=c.fetchone()
            if nbr[0]==0:
                c=sql.cursor()
                iArt+=1
                c.execute('INSERT INTO artistes(nom,status,path) VALUES("'+artiste+'",99,"'+artiste+'/")'.encode('utf8')

并且在文件中很好地显示的艺术家将错误写入 BDD。问题是什么？

score 62 · Accepted Answer

您不需要对已经编码的数据进行编码。当您尝试这样做时，Python 将首先尝试将其解码为 UTF-8 ，unicode然后才能将其编码回 UTF-8。这就是这里失败的原因：

>>> data = u'\u00c3'            # Unicode data
>>> data = data.encode('utf8')  # encoded to UTF-8
>>> data
'\xc3\x83'
>>> data.encode('utf8')         # Try to *re*-encode it
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

只需将数据直接写入文件，无需对已编码的数据进行编码。

如果您改为建立unicode值，则确实必须将它们编码为可写入文件。您想codecs.open()改用它，它会返回一个文件对象，该对象将为您将 unicode 值编码为 UTF-8。

您也真的不想写出 UTF-8 BOM，除非您必须支持无法读取 UTF-8 的 Microsoft 工具（例如 MS Notepad）。

对于您的 MySQL 插入问题，您需要做两件事：

添加charset='utf8'到您的MySQLdb.connect()通话中。

在查询或插入时使用unicode对象，而不是对象，而是使用 sql 参数，以便 MySQL 连接器可以为您做正确的事情：str

artiste = artiste.decode('utf8')  # it is already UTF8, decode to unicode

c.execute('SELECT COUNT(id) AS nbr FROM artistes WHERE nom=%s', (artiste,))

# ...

c.execute('INSERT INTO artistes(nom,status,path) VALUES(%s, 99, %s)', (artiste, artiste + u'/'))

如果您习惯于codecs.open()自动解码内容，它实际上可能会更好：

import codecs

sql = mdb.connect('localhost','admin','ugo&(-@F','music_vibration', charset='utf8')

with codecs.open('config/index/'+index, 'r', 'utf8') as findex:
    for line in findex:
        if u'#artiste' not in line:
            continue

        artiste=line.split(u'[:::]')[1].strip()

    cursor = sql.cursor()
    cursor.execute('SELECT COUNT(id) AS nbr FROM artistes WHERE nom=%s', (artiste,))
    if not cursor.fetchone()[0]:
        cursor = sql.cursor()
        cursor.execute('INSERT INTO artistes(nom,status,path) VALUES(%s, 99, %s)', (artiste, artiste + u'/'))
        artists_inserted += 1

您可能想复习一下 Unicode 和 UTF-8 和编码。我可以推荐以下文章：

Python Unicode HOWTO
Ned Batchelder 的实用 Unicode
每个软件开发人员绝对、绝对必须了解 Unicode 和字符集（没有任何借口！）作者：Joel Spolsky

score 3 · Accepted Answer

不幸的是，string.encode() 方法并不总是可靠的。查看此线程以获取更多信息：在 python 中将某些字符串（utf-8 或其他）转换为简单的 ASCII 字符串的万无一失的方法是什么

python - python编码utf-8

2 回答 2

Related

Reference