python - Python urllib2 decode chunked encoding

Question

I have the following code to open and read URLs:

html_data = urllib2.urlopen(req).read()

and I believe this is the most standard way to read data from HTTP. However, when the response have chunked tranfer-encoding, the response starts with the following characters:

1eb0\r\n2625\r\n
<?xml version="1.0" encoding="UTF-8"?>
...

This happens due to the mentioned above chunked encoding and thus my XML data becomes corrupted.

So I wonder how I can get rid of all meta-data related to the chunked encoding?

score 1 · Accepted Answer

我最终得到了自定义 xml 剥离，如下所示：

    xml_start = html_data.find('<?xml')
    xml_end = html_data.rfind('</mytag>')
    if xml_start !=0:
        log_user_action(req.get_host() ,'chunked data', html_data, {})
        html_data = html_data[xml_start:]
    if xml_end != len(html_data)-len('</mytag>')-1:
        html_data = html_data[:xml_end+1]

找不到任何简单的解决方案。

score 0 · Accepted Answer

0

1eb0\r\n2625\r\n 是重新组装的有效负载中的段开始/停止位置（十六进制）

于 2012-09-19T23:44:56.523 回答

score -1 · Accepted Answer

-1

您可以在 ?xml 之前删除所有内容

html_data = html_data[html_data.find('<?xml'):]

于 2011-08-28T14:08:04.317 回答

python - Python urllib2 decode chunked encoding

3 回答 3

Related

Reference