python - SQLite、python、unicode 和非 utf 数据

Question

我首先尝试使用 python 在 sqlite 中存储字符串，并得到消息：

sqlite3.ProgrammingError：除非您使用可以解释 8 位字节串的 text_factory（如 text_factory = str），否则不得使用 8 位字节串。强烈建议您将应用程序切换为 Unicode 字符串。

好的，我切换到 Unicode 字符串。然后我开始收到消息：

sqlite3.OperationalError：无法解码为带有文本“Sigur Rós”的 UTF-8 列“tag_artist”

尝试从数据库中检索数据时。更多研究，我开始用 utf8 对其进行编码，但随后“Sigur Rós”开始看起来像“Sigur RÃ³s”

注意： 正如@John Machin 指出的那样，我的控制台设置为显示在“latin_1”中。

是什么赋予了？在阅读完这篇文章后，描述了与我完全相同的情况，似乎建议是忽略其他建议并毕竟使用 8 位字节串。

在我开始这个过程之前，我对 unicode 和 utf 了解不多。在过去的几个小时里我学到了很多东西，但我仍然不知道是否有一种方法可以正确地将 'ó' 从 latin-1 转换为 utf-8 而不会破坏它。如果没有，为什么 sqlite 会“强烈推荐”我将应用程序切换到 unicode 字符串？

我将用我在过去 24 小时内学到的所有内容的摘要和一些示例代码来更新这个问题，这样我鞋子里的人就可以得到一个简单的（呃）指南。如果我发布的信息有任何错误或误导性，请告诉我，我会更新，或者你们中的一位资深人士可以更新。

答案摘要

让我首先陈述我所理解的目标。如果您尝试在它们之间进行转换，处理各种编码的目标是了解您的源编码是什么，然后使用该源编码将其转换为 unicode，然后将其转换为您想要的编码。Unicode 是一个基础，编码是该基础的子集的映射。utf_8 为 unicode 中的每个字符都留有空间，但由于它们与 latin_1 不在同一个位置，因此以 utf_8 编码并发送到 latin_1 控制台的字符串看起来不会像您期望的那样。在 python 中，获取 unicode 并进入另一种编码的过程如下所示：

str.decode('source_encoding').encode('desired_encoding')

或者如果 str 已经是 unicode

str.encode('desired_encoding')

对于 sqlite，我实际上并不想再次对其进行编码，我想对其进行解码并将其保留为 unicode 格式。当您尝试在 python 中使用 unicode 和编码时，您可能需要注意以下四件事。

您要使用的字符串的编码，以及您要使用的编码。
系统编码。
控制台编码。
源文件的编码

阐述：

(1) 当你从一个源中读取一个字符串时，它必须有一些编码，比如 latin_1 或 utf_8。就我而言，我从文件名中获取字符串，所以不幸的是，我可能会得到任何类型的编码。Windows XP 使用 UCS-2（一个 Unicode 系统）作为其原生字符串类型，这对我来说似乎是在欺骗。对我来说幸运的是，大多数文件名中的字符不会由一种以上的源编码类型组成，我认为我的所有字符要么完全是 latin_1，完全是 utf_8，要么只是纯 ascii（这是两者的子集那些）。所以我只是阅读它们并对其进行解码，就好像它们仍在 latin_1 或 utf_8 中一样。但是，您可以将 latin_1 和 utf_8 以及任何其他字符混合在 Windows 上的文件名中。有时这些字符可以显示为框，其他时候他们只是看起来被破坏了，而其他时候他们看起来是正确的（重音字符等等）。继续。

(2) Python 有一个默认的系统编码，它在 python 启动时设置，并且在运行时不能更改。有关详细信息，请参见此处。肮脏的摘要......这是我添加的文件：

\# sitecustomize.py  
\# this file can be anywhere in your Python path,  
\# but it usually goes in ${pythondir}/lib/site-packages/  
import sys  
sys.setdefaultencoding('utf_8')

当您使用没有任何其他编码参数的 unicode("str") 函数时，会使用此系统编码。换一种说法，python 尝试根据默认的系统编码将“str”解码为 unicode。

(3) 如果你使用的是 IDLE 或者命令行 python，我认为你的控制台会按照默认的系统编码显示。由于某种原因，我在 Eclipse 中使用 pydev，所以我必须进入我的项目设置，编辑我的测试脚本的启动配置属性，转到 Common 选项卡，然后将控制台从 latin-1 更改为 utf-8 以便我可以直观地确认我正在做的事情是有效的。

(4) 如果你想有一些测试字符串，例如

test_str = "ó"

在您的源代码中，那么您将不得不告诉 python 您在该文件中使用哪种编码。（仅供参考：当我输入错误的编码时，我不得不按 ctrl-Z，因为我的文件变得不可读。）这很容易通过在源代码文件的顶部放置这样的一行来完成：

# -*- coding: utf_8 -*-

如果您没有这些信息，python 默认会尝试将您的代码解析为 ascii，因此：

SyntaxError: Non-ASCII character '\xf3' in file _redacted_ on line 81, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

一旦你的程序正常工作，或者，如果你不使用 python 的控制台或任何其他控制台来查看输出，那么你可能真的只关心列表中的#1。除非您需要查看输出和/或您使用内置的 unicode() 函数（没有任何编码参数）而不是 string.decode() 函数，否则系统默认和控制台编码并不那么重要。我写了一个演示函数，我将粘贴到这个巨大的混乱的底部，我希望正确地演示我列表中的项目。这是我通过演示函数运行字符“ó”时的一些输出，显示了各种方法如何对作为输入的字符做出反应。这次运行我的系统编码和控制台输出都设置为 utf_8：

'�' = original char <type 'str'> repr(char)='\xf3'
'?' = unicode(char) ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data
'ó' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3'
'?' = char.decode('utf_8')  ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data

现在我将系统和控制台编码更改为 latin_1，我得到相同输入的输出：

'ó' = original char <type 'str'> repr(char)='\xf3'
'ó' = unicode(char) <type 'unicode'> repr(unicode(char))=u'\xf3'
'ó' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3'
'?' = char.decode('utf_8')  ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data

请注意，“原始”字符显示正确，并且内置 unicode() 函数现在可以正常工作。

现在我将控制台输出改回 utf_8。

'�' = original char <type 'str'> repr(char)='\xf3'
'�' = unicode(char) <type 'unicode'> repr(unicode(char))=u'\xf3'
'�' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3'
'?' = char.decode('utf_8')  ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data

在这里一切仍然与上次一样，但控制台无法正确显示输出。等等。下面的函数还显示了更多信息，希望能帮助人们找出他们理解的差距在哪里。我知道所有这些信息都在其他地方，并且在那里得到了更彻底的处理，但我希望这对于尝试使用 python 和/或 sqlite 进行编码的人来说是一个很好的起点。想法很棒，但有时源代码可以为您节省一两天的时间来弄清楚哪些功能做什么。

免责声明：我不是编码专家，我把这些放在一起是为了帮助我自己理解。当我可能应该开始将函数作为参数传递以避免如此多的冗余代码时，我继续构建它，所以如果可以的话，我会让它更简洁。此外，utf_8 和 latin_1 绝不是唯一的编码方案，它们只是我正在玩的两个，因为我认为它们可以处理我需要的一切。将您自己的编码方案添加到演示函数并测试您自己的输入。

还有一件事：显然有疯狂的应用程序开发人员让 Windows 的生活变得困难。

#!/usr/bin/env python
# -*- coding: utf_8 -*-

import os
import sys

def encodingDemo(str):
    validStrings = ()
    try:        
        print "str =",str,"{0} repr(str) = {1}".format(type(str), repr(str))
        validStrings += ((str,""),)
    except UnicodeEncodeError as ude:
        print "Couldn't print the str itself because the console is set to an encoding that doesn't understand some character in the string.  See error:\n\t",
        print ude
    try:
        x = unicode(str)
        print "unicode(str) = ",x
        validStrings+= ((x, " decoded into unicode by the default system encoding"),)
    except UnicodeDecodeError as ude:
        print "ERROR.  unicode(str) couldn't decode the string because the system encoding is set to an encoding that doesn't understand some character in the string."
        print "\tThe system encoding is set to {0}.  See error:\n\t".format(sys.getdefaultencoding()),  
        print ude
    except UnicodeEncodeError as uee:
        print "ERROR.  Couldn't print the unicode(str) because the console is set to an encoding that doesn't understand some character in the string.  See error:\n\t",
        print uee
    try:
        x = str.decode('latin_1')
        print "str.decode('latin_1') =",x
        validStrings+= ((x, " decoded with latin_1 into unicode"),)
        try:        
            print "str.decode('latin_1').encode('utf_8') =",str.decode('latin_1').encode('utf_8')
            validStrings+= ((x, " decoded with latin_1 into unicode and encoded into utf_8"),)
        except UnicodeDecodeError as ude:
            print "The string was decoded into unicode using the latin_1 encoding, but couldn't be encoded into utf_8.  See error:\n\t",
            print ude
    except UnicodeDecodeError as ude:
        print "Something didn't work, probably because the string wasn't latin_1 encoded.  See error:\n\t",
        print ude
    except UnicodeEncodeError as uee:
        print "ERROR.  Couldn't print the str.decode('latin_1') because the console is set to an encoding that doesn't understand some character in the string.  See error:\n\t",
        print uee
    try:
        x = str.decode('utf_8')
        print "str.decode('utf_8') =",x
        validStrings+= ((x, " decoded with utf_8 into unicode"),)
        try:        
            print "str.decode('utf_8').encode('latin_1') =",str.decode('utf_8').encode('latin_1')
        except UnicodeDecodeError as ude:
            print "str.decode('utf_8').encode('latin_1') didn't work.  The string was decoded into unicode using the utf_8 encoding, but couldn't be encoded into latin_1.  See error:\n\t",
            validStrings+= ((x, " decoded with utf_8 into unicode and encoded into latin_1"),)
            print ude
    except UnicodeDecodeError as ude:
        print "str.decode('utf_8') didn't work, probably because the string wasn't utf_8 encoded.  See error:\n\t",
        print ude
    except UnicodeEncodeError as uee:
        print "ERROR.  Couldn't print the str.decode('utf_8') because the console is set to an encoding that doesn't understand some character in the string.  See error:\n\t",uee

    print
    print "Printing information about each character in the original string."
    for char in str:
        try:
            print "\t'" + char + "' = original char {0} repr(char)={1}".format(type(char), repr(char))
        except UnicodeDecodeError as ude:
            print "\t'?' = original char  {0} repr(char)={1} ERROR PRINTING: {2}".format(type(char), repr(char), ude)
        except UnicodeEncodeError as uee:
            print "\t'?' = original char  {0} repr(char)={1} ERROR PRINTING: {2}".format(type(char), repr(char), uee)
            print uee    

        try:
            x = unicode(char)        
            print "\t'" + x + "' = unicode(char) {1} repr(unicode(char))={2}".format(x, type(x), repr(x))
        except UnicodeDecodeError as ude:
            print "\t'?' = unicode(char) ERROR: {0}".format(ude)
        except UnicodeEncodeError as uee:
            print "\t'?' = unicode(char)  {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee)

        try:
            x = char.decode('latin_1')
            print "\t'" + x + "' = char.decode('latin_1') {1} repr(char.decode('latin_1'))={2}".format(x, type(x), repr(x))
        except UnicodeDecodeError as ude:
            print "\t'?' = char.decode('latin_1')  ERROR: {0}".format(ude)
        except UnicodeEncodeError as uee:
            print "\t'?' = char.decode('latin_1')  {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee)

        try:
            x = char.decode('utf_8')
            print "\t'" + x + "' = char.decode('utf_8') {1} repr(char.decode('utf_8'))={2}".format(x, type(x), repr(x))
        except UnicodeDecodeError as ude:
            print "\t'?' = char.decode('utf_8')  ERROR: {0}".format(ude)
        except UnicodeEncodeError as uee:
            print "\t'?' = char.decode('utf_8')  {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee)

        print

x = 'ó'
encodingDemo(x)

非常感谢下面的答案，特别是@John Machin 的彻底回答。

score 35 · Accepted Answer

我仍然不知道是否有办法将“ó”从 latin-1 正确转换为 utf-8 而不会破坏它

在调试此类问题时，repr() 和 unicodedata.name() 是您的朋友：

>>> oacute_latin1 = "\xF3"
>>> oacute_unicode = oacute_latin1.decode('latin1')
>>> oacute_utf8 = oacute_unicode.encode('utf8')
>>> print repr(oacute_latin1)
'\xf3'
>>> print repr(oacute_unicode)
u'\xf3'
>>> import unicodedata
>>> unicodedata.name(oacute_unicode)
'LATIN SMALL LETTER O WITH ACUTE'
>>> print repr(oacute_utf8)
'\xc3\xb3'
>>>

如果您将 oacute_utf8 发送到为 latin1 设置的终端，您将获得 A-波浪号，后跟上标 3。

我切换到 Unicode 字符串。

你叫什么Unicode字符串？UTF-16？

是什么赋予了？在阅读完这篇文章后，描述了与我完全相同的情况，似乎建议是忽略其他建议并毕竟使用 8 位字节串。

我无法想象在你看来是怎样的。正在传达的故事是 Python 中的 unicode 对象和数据库中的 UTF-8 编码是要走的路。然而，马丁回答了最初的问题，为 OP 提供了一种能够使用 latin1 的方法（“文本工厂”）——这并不构成推荐！

针对评论中提出的这些进一步问题进行更新：

我不明白 unicode 字符仍然包含隐式编码。我说的对吗？

不，编码是 Unicode 和其他东西之间的映射，反之亦然。Unicode 字符没有隐式或其他编码。

在我看来 unicode("\xF3") 和 "\xF3".decode('latin1') 在使用 repr() 评估时是相同的。

说什么？在我看来它不像：

>>> unicode("\xF3")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf3 in position 0: ordinal
not in range(128)
>>> "\xF3".decode('latin1')
u'\xf3'
>>>

也许你的意思是：u'\xf3' == '\xF3'.decode('latin1')……这当然是真的。

unicode(str_object, encoding)与...相同的情况也是如此，str_object.decode(encoding)包括在提供不适当的编码时炸毁。

这是一个幸福的情况吗

Unicode 中的前 256 个字符是相同的，代码的代码，因为 latin1 中的 256 个字符是一个好主意。因为所有 256 个可能的 latin1 字符都映射到 Unicode，这意味着任何 8 位字节、任何 Python str 对象都可以解码为 unicode，而不会引发异常。这是应该的。

然而，有些人混淆了两个完全不同的概念：“我的脚本运行完成而没有引发任何异常”和“我的脚本没有错误”。对他们来说，latin1 是“一个圈套和一个错觉”。

换句话说，如果您有一个实际上以 cp1252 或 gbk 或 koi8-u 或其他格式编码的文件，并且您使用 latin1 对其进行解码，则生成的 Unicode 将完全是垃圾，而 Python（或任何其他语言）不会标记错误 - - 它无法知道你犯了一个愚蠢的错误。

还是 unicode("str") 总是返回正确的解码？

就像这样，默认编码是 ascii，如果文件实际上是用 ASCII 编码的，它将返回正确的 unicode。否则，它会炸毁。

同样，如果您指定正确的编码，或者是正确编码的超集，您将得到正确的结果。否则你会得到乱码或异常。

简而言之：答案是否定的。

如果没有，当我收到一个包含任何可能字符集的 python str 时，我怎么知道如何解码它？

如果 str 对象是一个有效的 XML 文档，它将被预先指定。默认为 UTF-8。如果它是一个正确构建的网页，则应在前面指定（查找“charset”）。不幸的是，许多网页作者都在说谎（ISO-8859-1 aka latin1，应该是 Windows-1252 aka cp1252；不要浪费资源尝试解码 gb2312，改用 gbk）。您可以从网站的国籍/语言中获得线索。

UTF-8 总是值得尝试的。如果数据是 ascii，它会正常工作，因为 ascii 是 utf8 的子集。如果您尝试将其解码为 utf8，则使用非 ascii 字符编写并以 utf8 以外的编码进行编码的文本字符串几乎肯定会失败并出现异常。

上述所有启发式方法以及更多和大量统计信息都封装在chardet中，这是一个用于猜测任意文件编码的模块。它通常运作良好。但是，您不能使软件白痴。例如，如果您将一些使用编码 A 和一些使用编码 B 编写的数据文件连接起来，并将结果提供给 chardet，则答案很可能是以降低的置信度（例如 0.8）对 C 进行编码。始终检查答案的置信度部分。

如果一切都失败了：

(1) 尝试在这里询问，从您的数据前面提取一个小样本print repr(your_data[:400])……以及您拥有的有关其出处的任何附属信息。

(2) 俄罗斯最近对找回忘记密码的技术的研究似乎非常适用于推断未知编码。

顺便说一句，更新 2，是不是该打开另一个问题了？-)

还有一件事：对于某些字符，Windows 显然将某些字符用作 Unicode，这些字符不是该字符的正确 Unicode，因此，如果您想在其他程序中使用这些字符，则可能必须将它们映射到正确的字符。期待这些角色出现在正确的位置。

不是 Windows 做的。这是一群疯狂的应用程序开发人员。您可能更容易理解的是没有转述，而是引用了您提到的 effbot 文章的开头段落：

某些应用程序将 CP1252（Windows、西欧）字符添加到标记为 ISO 8859-1（拉丁 1）或其他编码的文档中。这些字符不是有效的 ISO-8859-1 字符，可能会在处理和显示应用程序中引起各种问题。

背景：

U+0000 到 U+001F 的范围在 Unicode 中被指定为“C0 控制字符”。这些也存在于 ASCII 和 latin1 中，含义相同。它们包括诸如回车、换行、响铃、退格、制表符和其他很少使用的熟悉的东西。

U+0080 到 U+009F 的范围在 Unicode 中被指定为“C1 控制字符”。这些也存在于 latin1 中，包括 unicode.org 之外的任何人都无法想象的 32 个字符。

因此，如果您对 unicode 或 latin1 数据运行字符频率计数，并且发现该范围内的任何字符，则您的数据已损坏。没有通用的解决方案；这取决于它是如何损坏的。这些字符可能与相同位置的 cp1252 字符具有相同的含义，因此 effbot 的解决方案将起作用。在我最近看到的另一种情况下，狡猾的字符似乎是由连接以 UTF-8 编码的文本文件和需要根据文件所在（人类）语言中的字母频率推断的另一种编码引起的写在。

score 21 · Accepted Answer

UTF-8 是 SQLite 数据库的默认编码。这会出现在“SELECT CAST(x'52C3B373' AS TEXT);”这样的情况下。但是，SQLite C 库实际上并不检查插入数据库的字符串是否是有效的 UTF-8。

如果插入 Python unicode 对象（或 3.x 中的 str 对象），Python sqlite3 库会自动将其转换为 UTF-8。但是如果你插入一个 str 对象，它只会假设字符串是 UTF-8，因为 Python 2.x "str" 不知道它的编码。这是首选 Unicode 字符串的原因之一。

但是，如果您的数据一开始就损坏，它对您没有帮助。

要修复您的数据，请执行

db.create_function('FIXENCODING', 1, lambda s: str(s).decode('latin-1'))
db.execute("UPDATE TheTable SET TextColumn=FIXENCODING(CAST(TextColumn AS BLOB))")

对于数据库中的每个文本列。

score 19 · Accepted Answer

我通过设置解决了这个 pysqlite 问题：

conn.text_factory = lambda x: unicode(x, 'utf-8', 'ignore')

默认 text_factory 设置为 unicode()，它将使用当前的默认编码（我机器上的 ascii）

score 8 · Accepted Answer

当然有。但是您的数据已经在数据库中损坏，因此您需要修复它：

>>> print u'Sigur RÃ³s'.encode('latin-1').decode('utf-8')
Sigur Rós

score 4 · Accepted Answer

我使用 Python 2.x（具体是 Python 2.7.6）的 unicode 问题解决了这个问题：

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from __future__ import unicode_literals
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

它还解决了您在帖子开头提到的错误：

sqlite3.ProgrammingError：您不得使用 8 位字节串，除非...

编辑

sys.setdefaultencoding是一个肮脏的黑客。是的，它可以解决 UTF-8 问题，但一切都是有代价的。有关更多详细信息，请参阅以下链接：

python - SQLite、python、unicode 和非 utf 数据

5 回答 5

Related

Reference