python - Python：我使用 .decode() - 'ascii' 编解码器无法编码

Question

看来我使用了错误的功能。有.fromstring- 没有错误信息

xml_ = load() # here comes the unicode string with Cyrillic letters 

print xml_    # prints everything fine 

print type(xml_) # 'lxml.etree._ElementUnicodeResult' = unicode 

xml = xml_.decode('utf-8') # here is an error

doc = lxml.etree.parse(xml) # if I do not decode it - the same error appears here

 File "testLog.py", line 48, in <module>
    xml = xml_.decode('utf-8')
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 89-96: ordinal not in range(128)

如果

xml = xml_.encode('utf-8')

doc = lxml.etree.parse(xml) # here's an error

或者

xml = xml_

然后

UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 89: ordinal not in range(128)

如果我理解正确：我必须将非 ascii 字符串解码为内部表示，然后使用此表示并在发送到输出之前将其编码回来？看来我正是这样做的。

'Accept-Charset': 'utf-8'由于标题，输入数据必须在 unt-8 中。

score 6 · Accepted Answer

字符串和 Unicode 对象在内存中具有不同的类型和不同的内容表示。Unicode 是文本的解码形式，而字符串是编码形式。

# -*- coding: utf-8 --

# Now, my string literals in this source file will
#    be str objects encoded in utf-8.

# In Python3, they will be unicode objects.
#    Below examples show the Python2 way.

s = 'ş'
print type(s) # prints <type 'str'>

u = s.decode('utf-8')
# Here, we create a unicode object from a string
#    which was encoded in utf-8.

print type(u) # prints <type 'unicode'>

正如你看到的，

.encode() --> str
.decode() --> unicode

当我们对字符串进行编码或解码时，我们需要确保我们的文本应该包含在源/目标编码中。使用 iso-8859-9 无法正确解码 iso-8859-1 编码的字符串。

至于问题中的第二个错误报告，lxml.etree.parse()适用于类似文件的对象。要从字符串中解析，lxml.etree.fromstring()应该使用。

score 2 · Accepted Answer

如果您的原始字符串是 unicode，则仅将其编码为 utf-8 而不是从 utf-8 解码才有意义。

我认为 xml 解析器只能处理 ascii 的 xml。

所以用于xml = xml_.encode('ascii','xmlcharrefreplace')将不在 ascii 中的 unicode 字符转换为 xml 实体。

score 1 · Accepted Answer

1

对我来说，使用.fromstring()方法是需要的。

于 2014-03-18T20:15:01.173 回答

score 1 · Accepted Answer

lxml 库已经为您提供了 unicode 类型的内容。你遇到了 python2 的 unicode/bytes 自动转换。提示是你要求它，decode但你得到一个编码错误。它试图将您的 utf8 字符串转换为默认字节编码，然后将其解码回 unicode。

在 unicode 对象上使用 .encode 方法转换为字节（str类型）。

观看此内容将教您很多有关如何解决此问题的知识：http: //nedbatchelder.com/text/unipain.html

score 1 · Accepted Answer

我假设您正在尝试解析一些网站？

您是否确认该网站是正确的？也许他们的编码不正确？

许多网站都损坏了，并且依赖网络浏览器来拥有非常强大的解析器。你可以试试beautifulsoup，它也很健壮。

有一个事实上的网络标准，即“字符集”HTML 标头（可能包括协商并与您提到的 Accept-Encoding 相关）被HTML 文件中的任何标记所推翻<meta http-equiv=...！

所以你可能只是没有UTF-8 输入！

python - Python：我使用 .decode() - 'ascii' 编解码器无法编码

5 回答 5

Related

Reference