python - 使用 python 的 urllib2 urlopen 时缺少“内容长度”标头

Question

尝试在 python 中使用 urllib2 检查某些网页的“内容长度”标头时，标头丢失。例如，来自 google.com 的响应缺少此标头。知道为什么吗？

例子：

r = urllib2.urlopen('http://www.google.com')
i = r.info()
print i.keys()

给出：

['x-xss-protection', 'set-cookie', 'expires', 'server', 'connection', 'cache-control', 'date', 'p3p', 'content-type', 'x-frame-options']

score 1 · Accepted Answer

您可以在此处看到 http 响应可以包含Content-Length或Transfer-Encoding: chunked。

但是，当Transfer-Encoding: chunked在标头中使用时，在标头之后，您将得到一个十六进制字符串，如果将其转换为十进制，将为您提供下一个块的长度。在最后一个块之后，你会得到一个0这个值，这意味着你已经到达了文件的末尾。

您可以使用正则表达式来获取这个十六进制值（虽然不是必须的）

read = #string containing a line or a part of the http response
hexPat = re.compile(r'([0-9A-F]+)\r\n', re.I)
match = re.search(hexPat, read)
chunkLen = int(match.group(1), 16) #converts hexadecimal to decimal

或者您可以只读取第一个十六进制值，获取第一个块的长度并接收该块，然后获取下一个块的长度，依此类推，直到找到0

score 0 · Accepted Answer

HEAD 响应的 Content-Length 应该，但并不总是包含 GET 响应的 Content-Length 值：

堆栈溢出会：

> telnet stackoverflow.com 80
HEAD / HTTP/1.1
Host: stackoverflow.com


HTTP/1.1 200 OK
Cache-Control: public, max-age=60
Content-Length: 362245                           <--------
Content-Type: text/html; charset=utf-8
Expires: Mon, 04 Oct 2010 11:51:49 GMT
Last-Modified: Mon, 04 Oct 2010 11:50:49 GMT
Vary: *
Date: Mon, 04 Oct 2010 11:50:49 GMT

谷歌没有：

> telnet www.google.com 80
HEAD / HTTP/1.1
Host: www.google.ie


HTTP/1.1 200 OK
Date: Mon, 04 Oct 2010 11:55:36 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
Server: gws
X-XSS-Protection: 1; mode=block
Transfer-Encoding: chunked

python - 使用 python 的 urllib2 urlopen 时缺少“内容长度”标头

2 回答 2

Related

Reference