python - 使用 Python 脚本和 TOR 缓慢下载（包括源代码）

Question

我正在尝试使用我的 python 脚本和 TOR 代理服务器下载 html 页面。它运行良好。但是非常慢并且代码没有组织，所以我的 IP 大部分时间都在更新，而不是大量下载页面。如何使用 TOR 加快下载速度？如何组织代码效率。

有两个脚本。执行 Script1 以从网站下载 html 页面，从网站获取阻止后，必须执行 Script2 以在 TOR 代理的帮助下更新 IP。依此类推... IP 在几秒钟后被阻止。我应该降低线程吗？如何？请帮助我加快进程。我每小时只能获得 300-500 个 html 页面。

这是我的 Script1 的完整代码：

# -*- coding: UTF-8 -*-
import os
import sys
import socks
import socket
import subprocess
import time
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS4, '127.0.0.1', 9050, True)
socket.socket = socks.socksocket
import urllib2
class WebPage:

    def __init__(self, path, country, url, lower=0,upper=9999):
        self.dir = str(path)+"/"+ str(country)
        self.dir =os.path.join(str(path),str(country))
        self.url = url
        try:
            fin = open(self.dir+"/limit.txt",'r')
            limit = fin.readline()
            limits = str(limit).split(",")
            lower = int(limits[0])
            upper = int(limits[1])
            fin.close()
        except:
            fout = open(self.dir+"/limit.txt",'wb')
            limits = str(lower)+","+str(upper)
            fout.write(limits)
            fout.close()  
        self.process_instances(lower,upper)


    def process_instances(self,lower,upper):
            try:
                os.stat(self.dir)
            except:
                os.mkdir(self.dir)
            for count in range(lower,upper+1):
                if count == upper:
                    print "all downloaded, quitting the app!!"
                    break
                targetURL = self.url+"/"+str(count)
                print "Downloading :" + targetURL
                req = urllib2.Request(targetURL)
                try:
                    response = urllib2.urlopen(req)
                    the_page = response.read()  
                    if the_page.find("Your IP suspended")>=0:
                        print "The IP is suspended"
                        fout = open(self.dir+"/limit.txt",'wb')
                        limits = str(count)+","+str(upper)
                        fout.write(limits)
                        fout.close()  
                        break
                    if the_page.find("Too many requests")>=0:
                        print "Too many requests"
                        print "Renew IP...."
                        fout = open(self.dir+"/limit.txt",'wb')
                        limits = str(count)+","+str(upper)
                        fout.write(limits)
                        fout.close()
                        subprocess.Popen("C:\Users\John\Desktop\Data-Mine\yp\lol\lol2.py", shell=True)
                        time.sleep(2)
                        subprocess.call('lol1.py')
                    if the_page.find("404 error")>=0:
                        print "the page not exist"
                        continue
                    self.saveHTML(count, the_page)
                except:
                        print "The URL cannot be fetched"
                        execfile('lol1.py')
                        pass
                        #continue
                        raise                 
    def saveHTML(self,count, content):
        fout = open(self.dir+"/"+str(count)+".html",'wb')
        fout.write(content)
        fout.close()
if __name__ == '__main__':

    if len(sys.argv) !=6:
        print "cannot process!!! Five Parameters are required to run the process."
        print "Parameter 1 should be the path where to save the data, eg, /Users/john/data/"
        print "Parameter 2 should be the name of the country for which data is collected, eg, japan"
        print "Parameter 3 should be the URL from which the data to collect, eg, the website link"
        print "Parameter 4 should be the lower limit of the company id, eg, 11 "
        print "Parameter 5 should be the upper limit of the company id, eg, 1000 "
        print "The output will be saved as the HTML file for each company in the target folder's country"
        exit()

       else:
        path = str(sys.argv[1])
        country = str(sys.argv[2])
        url = str(sys.argv[3])
        lowerlimit = int(sys.argv[4])
        upperlimit = int(sys.argv[5])
        WebPage(path, country, url, lowerlimit,upperlimit)

score 0 · Accepted Answer

TOR非常慢，因此预计您每小时不会获得那么多页面。但是，有一些方法可以加快速度。最值得注意的是，您可以为 urllib 打开 GZIP 压缩（例如，请参阅此问题）以稍微提高速度。

TOR 作为协议的带宽相当低，因为数据需要中继几次，并且每个中继必须使用其带宽来满足您的请求。如果数据被中继 6 次 - 一个相当可能的数字 - 您将需要 6 倍的带宽。GZIP 压缩可以将 HTML 压缩到（在某些情况下）原始大小的 10% 左右，这样可能会加快处理速度。

python - 使用 Python 脚本和 TOR 缓慢下载（包括源代码）

1 回答 1

Related

Reference