2

I have the following code to open and read URLs:

html_data = urllib2.urlopen(req).read()

and I believe this is the most standard way to read data from HTTP. However, when the response have chunked tranfer-encoding, the response starts with the following characters:

1eb0\r\n2625\r\n
<?xml version="1.0" encoding="UTF-8"?>
...

This happens due to the mentioned above chunked encoding and thus my XML data becomes corrupted.

So I wonder how I can get rid of all meta-data related to the chunked encoding?

4

3 回答 3

1

我最终得到了自定义 xml 剥离,如下所示:

    xml_start = html_data.find('<?xml')
    xml_end = html_data.rfind('</mytag>')
    if xml_start !=0:
        log_user_action(req.get_host() ,'chunked data', html_data, {})
        html_data = html_data[xml_start:]
    if xml_end != len(html_data)-len('</mytag>')-1:
        html_data = html_data[:xml_end+1]

找不到任何简单的解决方案。

于 2011-09-04T12:15:56.263 回答
0

1eb0\r\n2625\r\n 是重新组装的有效负载中的段开始/停止位置(十六进制)

于 2012-09-19T23:44:56.523 回答
-1

您可以在 ?xml 之前删除所有内容

html_data = html_data[html_data.find('<?xml'):]
于 2011-08-28T14:08:04.317 回答