python - python的wget与urlretrieve

Question

我的任务是从网站下载 Gbs 的数据。数据采用 .gz 文件的形式，每个文件大小为 45mb。

获取文件的简单方法是使用“wget -r -np -A files url”。这将以递归格式下载数据并镜像网站。下载速率非常高 4mb/sec。

但是，只是为了玩耍，我还使用 python 来构建我的 urlparser。

通过 Python 的 urlretrieve 下载速度非常慢，可能是 wget 的 4 倍。下载速率为 500kb/秒。我使用 HTMLParser 来解析 href 标签。

我不确定为什么会这样。有没有这方面的设置。

谢谢

score 40 · Accepted Answer

40

可能是您的单位数学错误。

只是注意到500KB/s (kilobytes) 等于 4Mb/s (megabits)。

于 2009-06-10T18:14:50.453 回答

score 9 · Accepted Answer

urllib 对我的工作速度和 wget 一样快。试试这个代码。它以百分比显示进度，就像 wget 一样。

import sys, urllib
def reporthook(a,b,c): 
    # ',' at the end of the line is important!
    print "% 3.1f%% of %d bytes\r" % (min(100, float(a * b) / c * 100), c),
    #you can also use sys.stdout.write
    #sys.stdout.write("\r% 3.1f%% of %d bytes" 
    #                 % (min(100, float(a * b) / c * 100), c)
    sys.stdout.flush()
for url in sys.argv[1:]:
     i = url.rfind('/')
     file = url[i+1:]
     print url, "->", file
     urllib.urlretrieve(url, file, reporthook)

score 3 · Accepted Answer

至于 html 解析，您可能会得到的最快/最简单的方法是使用lxml 至于 http 请求本身：httplib2非常易于使用，并且可能会加快下载速度，因为它支持 http 1.1 keep-alive 连接和 gzip 压缩。还有pycURL声称非常快（但更难使用），并且建立在 curllib 上，但我从未使用过。

您也可以尝试同时下载不同的文件，但请记住，尝试过分优化下载时间可能对相关网站不太礼貌。

抱歉缺少超链接，但 SO 告诉我“对不起，新用户最多只能发布一个超链接”

score 3 · Accepted Answer

传输速度很容易产生误导。您能否尝试使用以下脚本，该脚本只需下载相同的 URL，wget然后urllib.urlretrieve- 运行几次，以防您在第二次尝试时缓存 URL 的代理后面。

对于小文件，由于外部进程的启动时间，wget 将花费稍长的时间，但对于应该无关紧要的较大文件。

from time import time
import urllib
import subprocess

target = "http://example.com" # change this to a more useful URL

wget_start = time()

proc = subprocess.Popen(["wget", target])
proc.communicate()

wget_end = time()


url_start = time()
urllib.urlretrieve(target)
url_end = time()

print "wget -> %s" % (wget_end - wget_start)
print "urllib.urlretrieve -> %s"  % (url_end - url_start)

score 1 · Accepted Answer

由于 python 建议使用urllib2而不是urllib，所以我在 and 之间进行了urllib2.urlopen测试wget。

结果是，他们两个下载同一个文件所需的时间几乎相同。有时，urllib2性能甚至更好。

优点wget在于动态进度条显示传输时完成的百分比和当前的下载速度。

我测试中的文件大小是5MB。我没有在 python 中使用任何缓存模块，我不知道wget下载大文件时的工作原理。

score 1 · Accepted Answer

1

也许您可以 wget 然后检查 Python 中的数据？

于 2009-06-10T10:38:52.667 回答

score 1 · Accepted Answer

import subprocess

myurl = 'http://some_server/data/'
subprocess.call(["wget", "-r", "-np", "-A", "files", myurl])

score 0 · Accepted Answer

真的应该没有区别。urlretrieve 所做的只是发出一个简单的 HTTP GET 请求。你有没有拿出你的数据处理代码，对 wget 和纯 python 进行直接的吞吐量比较？

score 0 · Accepted Answer

请给我们看一些代码。我很确定它必须与代码一起，而不是在 urlretrieve 上。

我过去曾使用过它，从未遇到过任何与速度相关的问题。

score 0 · Accepted Answer

0

您可以使用wget -k所有网址中的相关链接。

于 2010-02-28T09:35:55.877 回答

python - python的wget与urlretrieve

10 回答 10

Related

Reference