python - 解析 XML 文件获取 UnicodeEncodeError (ElementTree) / ValueError (lxml)

Question

import requests

url = "http://api.careerbuilder.com/v1/jobsearch"
payload = {'DeveloperKey': 'MY_DEVLOPER_KEY',
           'JobTitle': 'Biologist'}
r = requests.get(url, params=payload)
xml = r.text

并取回一个看起来像这样的 XML 。但是，我无法解析它。

使用任一lxml

>>> from lxml import etree
>>> print etree.fromstring(xml)

Traceback (most recent call last):
  File "<pyshell#4>", line 1, in <module>
    print etree.fromstring(xml)
  File "lxml.etree.pyx", line 2992, in lxml.etree.fromstring (src\lxml\lxml.etree.c:62311)
  File "parser.pxi", line 1585, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:91625)
ValueError: Unicode strings with encoding declaration are not supported.

或者ElementTree:

Traceback (most recent call last):
  File "<pyshell#3>", line 1, in <module>
    print ET.fromstring(xml)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1301, in XML
    parser.feed(text)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1641, in feed
    self._parser.Parse(data, 0)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 3717: ordinal not in range(128)

因此，即使 XML 文件以

<?xml version="1.0" encoding="UTF-8"?>

我的印象是它包含不允许的字符。如何使用lxmlor解析此文件ElementTree？

score 16 · Accepted Answer

您正在使用解码后的unicode 值。改用r.raw原始响应数据：

r = requests.get(url, params=payload, stream=True)
r.raw.decode_content = True
etree.parse(r.raw)

它将直接从响应中读取数据；请注意stream=True选项.get()。

设置该r.raw.decode_content = True标志可确保原始套接字将为您提供解压缩的内容，即使响应是 gzip 或 deflate 压缩。

您不必流式传输响应；对于较小的 XML 文档，可以使用response.content属性，它是未解码的响应正文：

r = requests.get(url, params=payload)
xml = etree.fromstring(r.content)

XML 解析器总是期望字节作为输入，因为 XML 格式本身决定了解析器如何将这些字节解码为 Unicode 文本。

score 5 · Accepted Answer

Correction!

See below how I got it all wrong. Basically, when we use the method .text then the result is a unicode encoded string. Using it raises the following exception in lxml

ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

Which basically means that @martijn-pieters was right, we must use the raw response as returned by .content

Incorrect answer (but might be interesting to someone)

For whoever is interested. I believe the reason this error occurs is probably an invalid guess taken by requests as explained in Response.text documentation:

Content of the response, in unicode.

If Response.encoding is None, encoding will be guessed using chardet.

The encoding of the response content is determined based solely on HTTP headers, following RFC 2616 to the letter. If you can take advantage of non-HTTP knowledge to make a better guess at the encoding, you should set r.encoding appropriately before accessing this property.

So, following this, one could also make sure requests' r.text encodes the response content correctly by explicitly setting the encoding with r.encoding = 'UTF-8'

This approach adds another validation that the received response is indeed in the correct encoding prior to parsing it with lxml.

score 0 · Accepted Answer

了解这个问题已经得到了答案，我在 Python3 上遇到了类似的问题，它在 Python2 上运行良好。我的解决方案是：str_xml.encode()然后xml = etree.fromstring(str_xml)是标签和属性的解析和提取。

python - 解析 XML 文件获取 UnicodeEncodeError (ElementTree) / ValueError (lxml)

3 回答 3

Correction!

Incorrect answer (but might be interesting to someone)

Related

Reference