python - 当我使用 Python 2.7 将 unicode 字符插入 sqlite3 数据库时，为什么会得到额外的转义字符？

Question

我查询一个 API 并获得一个具有以下值的 json blob：

{
    ...
    "Attribute" : "Some W\u00e9irdness", 
    ...
}

（当然，正确的值是“Some Wéirdness”）

我将该值与其他一些东西一起添加到我想要添加到我的 sqlite3 数据库的字段列表中。该列表如下所示：

[None, 203, None, None, True, u'W\xe9irdness', None, u'Some', None, None, u'Some W\xe9irdness', None, u'Some W\xe9irdness', None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]

我注意到我们已经经历了从 \x00e9 到 \xe9 的转换，我还不确定为什么会这样，但我希望这没关系......这只是一个不同的 unicode 编码。

在尝试插入 sqlite 表之前，我将列表“字符串化”（参见下面的函数）并将其设为元组：

('', '203', '', '', 'True', 'W\xe9irdness', '', 'Some', '', '', 'Some W\xe9irdness', '', 'Some W\xe9irdness', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '')

然后我进行插入：

my_tuple = tuple(val for val in my_utils.stringatize(my_list))

sql = "INSERT OR REPLACE INTO roster VALUES %s" % repr(my_tuple)

cur.execute(sql)

当我稍后使用 SELECT 语句检索它时，该值添加了一个额外的转义（反斜杠）字符：

u'Some W\\xe9irdness'

首先，我已经知道我不应该在 sqlite 中使用字符串插值。但是，当每条记录的字段数可能随时间变化时，我无法弄清楚如何使用？字段。（如果你知道更好的方法来做到这一点，我会全力以赴，但这可能是另一篇文章。）

为了排除故障，我打印了格式化的插入 sql 语句，我只看到一个反斜杠：

INSERT OR REPLACE INTO roster VALUES ('', '203', '', '', 'True', 'W\xe9irdness', '', 'Some', '', '', 'Some W\xe9irdness', '', 'Some W\xe9irdness', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '')

这与我在上面的列表中的显示方式相同，所以我很困惑。也许这被解释为带有必须转义的反斜杠的字符串，并且 xe9 只是被视为 ascii 文本。这是我用来准备插入列表的 stringatize 函数：

def stringatize(cell_list, encoding = 'raw_unicode_escape', delete_quotes = False):
    """
    Converts every 'cell' in a 'row' (generally something extracted from
     a spreadsheet) to a unicode, then returns the list of cells (with all
     strings now, of course).
    """

    stringatized_list = []

    for cell in cell_list:
        if isinstance(cell, (datetime.datetime)):
            new = cell.strftime("%Y-%m-%dT%H:%M:%S")
        elif isinstance(cell, (datetime.date)):
            new = cell.strftime("%Y-%m-%d")
        elif isinstance(cell, (datetime.time)):
            new = cell.strftime("%H:%M:%S")
        elif isinstance(cell, (int, long)):
            new = str(cell)    
        elif isinstance(cell, (float)):    
            new = "%.2f" % cell
        elif cell == None:
            new = ""    
        else:                
            new = cell    

        if delete_quotes:    
            new = new.replace("\"","")   

        my_unicode = new.encode(encoding)    
        stringatized_list.append(my_unicode)

    return stringatized_list

我很感激你在这方面对我的任何想法。目标是最终将此值转储到 Excel 工作表中，该工作表可以使用 Unicode，因此应该正确显示该值。

编辑：响应@CL 的询问，我尝试从我的字符串化函数中删除“编码”行。

现在结束如下：

    #my_unicode = new.encode(encoding)
    my_unicode = new

    stringatized_list.append(my_unicode)

return stringatized_list

新的 sql 看起来像这样（下面是我尝试执行它时得到的回溯）：

INSERT OR REPLACE INTO roster VALUES ('', u'203', u'', u'', 'True', u'W\xe9irdness', '', u'Some', '', '', u'Some W\xe9irdness', '', u'Some W\xe9irdness', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '')

Traceback (most recent call last):
  File "test.py", line 80, in <module>
    my_call
  File redacted.py, line 102, in my_function
    cur.execute(sql)
sqlite3.OperationalError: near "'203'": syntax error

我的意思是将该数字转换为字符串。我怀疑它与我正在做的 repr(my_tuple) 和 u'' 实际上不再象征着 unicode 有关。

score 2 · Accepted Answer

"Some W\u00e9irdness"
"Some Wéirdness"

是具有完全相同值的同等有效的 JSON 字符串文字形式，Some Wéirdness.

u'W\xe9irdness'

我注意到我们已经经历了从 \x00e9 到 \xe9 的转换，我还不确定为什么会这样，但我希望这没关系......这只是一个不同的 unicode 编码。

没有开关，也没有编码，字符串仍然是Some Wéirdness.

您刚刚从 Python 打印了字符串，而在 Python 字符串文字中，有一种\xNNJSON 没有的形式，即\u00NN.

my_tuple = tuple(val for val in my_utils.stringatize(my_list))
sql = "INSERT OR REPLACE INTO roster VALUES %s" % repr(my_tuple)
cur.execute(sql)

不要这样做。由生成的 Python 元组文字与reprSQL 值列表的格式完全不同。特别是，SQL 字符串文字没有任何反斜杠转义的概念，因此在 Python Unicode 字符串文字\xE9中表示 an é，在 SQL 中仅表示反斜杠、字母x和E数字9。

虽然有一些适当的方法可以对字符串进行编码以适应 SQL 字符串文字，但您应该避免这种情况，因为正确处理并不简单，错误处理是一个安全问题。相反，忘记“字符串化”，只需将原始值作为参数传递给数据库：

cur.execute(
    'INSERT OR REPLACE INTO roster VALUES (?, ?, ?, ?, ....)',
    my_list
)

python - 当我使用 Python 2.7 将 unicode 字符插入 sqlite3 数据库时，为什么会得到额外的转义字符？

1 回答 1

Related

Reference