python - 使用 python 下载文件（REST URL）

Question

我正在尝试编写一个脚本，该脚本将从具有 REST URL 的网站下载一堆文件。

这是 GET 请求：

GET /test/download/id/5774/format/testTitle HTTP/1.1
Host: testServer.com
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Cookie: __utma=11863783.1459862770.1379789243.1379789243.1379789243.1; __utmb=11863783.28.9.1379790533699; __utmc=11863783; __utmz=11863783.1379789243.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); PHPSESSID=fa844952890e9091d968c541caa6965f; loginremember=Qraoz3j%2BoWXxwqcJkgW9%2BfGFR0SDFLi1FLS7YVAfvbcd9GhX8zjw4u6plYFTACsRruZM4n%2FpX50%2BsjXW5v8vykKw2XNL0Vqo5syZKSDFSSX9mTFNd5KLpJV%2FFlYkCY4oi7Qyw%3D%3D; ma-refresh-storage=1; ma-pref=KLSFKJSJSD897897; skipPostLogin=0; pp-sid=hlh6hs1pnvuh571arl59t5pao0; __utmv=11863783.|1=MemberType=Yearly=1; nats_cookie=http%253A%252F%252Fwww.testServer.com%252F; nats=NDc1NzAzOjQ5MzoyNA%2C74%2C0%2C0%2C0; nats_sess=fe3f77e6e326eb8d18ef0111ab6f322e; __utma=163815075.1459708390.1379790355.1379790355.1379790355.1; __utmb=163815075.1.9.1379790485255; __utmc=163815075; __utmz=163815075.1379790355.1.1.utmcsr=ppp.contentdef.com|utmccn=(referral)|utmcmd=referral|utmcct=/postlogin; unlockedNetworks=%5B%22rk%22%2C%22bz%22%2C%22wkd%22%5D
Connection: close

如果请求良好，它将返回 302 响应，例如：

HTTP/1.1 302 Found
Date: Sat, 21 Sep 2013 19:32:37 GMT
Server: Apache
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
location: http://downloads.test.stuff.com/5774/stuff/picture.jpg?wed=20130921152237&wer=20130922153237&hash=0f20f4a6d0c9f1720b0b6
Vary: User-Agent,Accept-Encoding
Content-Length: 0
Connection: close
Content-Type: text/html; charset=UTF-8

我需要脚本做的是检查它是否是 302 响应。如果不是，它将“通过”，如果是，则需要解析出此处显示的位置参数：

location: http://downloads.test.stuff.com/5774/stuff/picture.jpg?wed=20130921152237&wer=20130922153237&hash=0f20f4a6d0c9f1720b0b6

获得位置参数后，我将不得不发出另一个 GET 请求来下载该文件。我还必须为我的会话维护 cookie 才能下载文件。

有人可以为我指出最适合使用哪个库的正确方向吗？我无法找出如何解析 302 响应并添加一个 cookie 值，就像上面我的 GET 请求中显示的那样。我相信一定有一些图书馆可以做到这一切。

任何帮助将非常感激。

score 0 · Accepted Answer

import urllib.request as ur
import urllib.error as ue

'''
Note that http.client.HTTPResponse.read([amt]) reads and returns the response body, or up to 
the next amt bytes. This is because there is no way for urlopen() to automatically determine 
the encoding of the byte stream it receives from the http server. 
'''

url = "http://www.example.org/images/{}.jpg"

dst = ""
arr = ["01","02","03","04","05","06","07","08","09"]
# arr = range(10,20)
try:
    for x in arr:
        print(str(x)+"). ".ljust(4),end="")
        hrio = ur.urlopen(url.format(x)) # HTTPResponse iterable object (returns the response header and body, together, as bytes)
        fh = open(dst+str(x)+".jpg","b+w")
        fh.write(hrio.read())
        fh.close()
        print("\t[REQUEST COMPLETE]\t\t<Error ~ [None]>")
except ue.URLError as e:
    print("\t[REQUEST INCOMPLETE]\t",end="")
    print("<Error ~ [{}]>".format(e))

python - 使用 python 下载文件（REST URL）

1 回答 1

Related

Reference