python - Python,multi-threads,fetch webpages,download webpages

Question

I want to batch dowload webpages in one site. There are 5000000 urls links in my 'urls.txt' file. It's about 300M. How make a multi-threads link these urls and dowload these webpages? or How batch dowload these webpages?

my ideas:

with open('urls.txt','r') as f:
    for el in f:
        ##fetch these urls

or twisted?

Is there a good solution for it?

score 3 · Accepted Answer

如果这不是一个更大程序的一部分，那么 notnoop 使用一些现有工具来完成这个的想法是一个非常好的想法。如果调用 wget 的 shell 循环解决了您的问题，那将比涉及更多定制软件开发的任何事情容易得多。

但是，如果您需要将这些资源作为更大程序的一部分来获取，那么使用 shell 执行此操作可能并不理想。在这种情况下，我强烈推荐 Twisted，它可以很容易地并行处理多个请求。

几年前，我写了一个如何做到这一点的例子。看看http://jcalderone.livejournal.com/24285.html。

score 1 · Accepted Answer

绝对一次下载 5M 网页不是一个好主意，因为你会最大化很多东西，包括你的网络带宽和你的操作系统的文件描述符。我会分批去100-1000个。您可以使用 urllib.urlopen 获取套接字，然后在多个线程上读取（）。您也许可以使用 select.select。如果是这样，那么继续下载所有 1000 个，并将 select 返回的每个文件句柄分配给 10 个工作线程。如果选择不起作用，则将您的批次限制为 100 次下载，并且每次下载使用一个线程。当然，您不应该启动超过 100 个线程，因为您的操作系统可能会崩溃或至少运行得有点慢。

score 1 · Accepted Answer

首先解析您的文件并将 URL 推送到队列中，然后生成 5-10 个工作线程以将 URL 从队列中拉出并下载。队列是你的朋友。

score 0 · Accepted Answer

wget 脚本可能是最简单的，但如果您正在寻找 python-twisted 爬行解决方案，请查看scrapy

python - Python,multi-threads,fetch webpages,download webpages

4 回答 4

Related

Reference