python - 在 python 中使用 encode('utf-8') 从 Excel 中读取字符串的缺点

Question

我正在从 Excel 电子表格中读取大量数据，在该电子表格中，我使用以下一般结构从电子表格中读取（并重新格式化和重写）：

book = open_workbook('file.xls')
sheettwo = book.sheet_by_index(1)
out = open('output.file', 'w')
for i in range(sheettwo.nrows):
     z = i + 1
     toprint = """formatting of the data im writing. important stuff is to the right -> """ + str(sheettwo.cell(z,y).value) + """ more formatting! """ + str(sheettwo.cell(z,x).value.encode('utf-8')) + """ and done"""
     out.write(toprint)
     out.write("\n")

在这种情况下，x 和 y 是任意单元格，其中 x 不那么任意并且包含 utf-8 字符

到目前为止，我只在我知道会有错误的单元格中使用 .encode('utf-8') ，或者在不使用 utf-8 的情况下预见错误。

我的问题基本上是这样的：在所有单元格上使用 .encode('utf-8') 是否有缺点，即使它是不必要的？效率不是问题。主要问题是，即使在不应该存在的地方有 utf-8 字符，它也能正常工作。如果我只是将“.encode('utf-8')”集中到每个读取的单元格上就不会发生错误，我可能最终会这样做。

score 4 · Accepted Answer

XLRD文档明确指出：“从 Excel 97 开始，Excel 电子表格中的文本已存储为 Unicode。”。由于您可能正在读取比 97 新的文件，因此它们无论如何都包含 Unicode 代码点。因此，有必要在 Python 中将这些单元格的内容保持为 Unicode，并且不要将它们转换为 ASCII（您可以使用 str() 函数执行此操作）。使用下面的代码：

book = open_workbook('file.xls')
sheettwo = book.sheet_by_index(1)
#Make sure your writing Unicode encoded in UTF-8
out = open('output.file', 'w')
for i in range(sheettwo.nrows):
    z = i + 1
    toprint = u"formatting of the data im writing. important stuff is to the right -> " + unicode(sheettwo.cell(z,y).value) + u" more formatting! " + unicode(sheettwo.cell(z,x).value) + u" and done\n"
    out.write(toprint.encode('UTF-8'))

score 0 · Accepted Answer

这个答案实际上是对已接受答案的一些温和评论，但它们需要比 SO 评论工具提供的更好的格式。

(1) 避免 SO 水平滚动条会增加人们阅读您的代码的机会。尝试换行，例如：

toprint = u"".join([
    u"formatting of the data im writing. "
    u"important stuff is to the right -> ",
    unicode(sheettwo.cell(z,y).value),
    u" more formatting! ",
    unicode(sheettwo.cell(z,x).value),
    u" and done\n"
    ])
out.write(toprint.encode('UTF-8'))

(2) 大概您正在使用unicode()将浮点数和整数转换为 unicode；它对已经是 unicode 的值没有任何作用。请注意unicode()，与str() 一样，浮点数仅提供 12 位精度：

>>> unicode(123456.78901234567)
u'123456.789012'

如果这很麻烦，您可能想尝试这样的事情：

>>> def full_precision(x):
>>> ... return unicode(repr(x) if isinstance(x, float) else x)
>>> ...
>>> full_precision(u'\u0400')
u'\u0400'
>>> full_precision(1234)
u'1234'
>>> full_precision(123456.78901234567)
u'123456.78901234567'

(3)在需要时动态xlrd构建Cell对象。

sheettwo.cell(z,y).value # slower
sheettwo.cell_value(z,y) # faster

python - 在 python 中使用 encode('utf-8') 从 Excel 中读取字符串的缺点

2 回答 2

Related

Reference