python - 在python中使用多线程时如何获得更快的速度

Question

现在我正在研究如何尽快从网站获取数据。为了获得更快的速度，我正在考虑使用多线程。这是我用来测试多线程和简单帖子之间区别的代码。

import threading
import time
import urllib
import urllib2


class Post:

    def __init__(self, website, data, mode):
        self.website = website
        self.data = data

        #mode is either "Simple"(Simple POST) or "Multiple"(Multi-thread POST)
        self.mode = mode

    def post(self):

        #post data
        req = urllib2.Request(self.website)
        open_url = urllib2.urlopen(req, self.data)

        if self.mode == "Multiple":
            time.sleep(0.001)

        #read HTMLData
        HTMLData = open_url.read()



        print "OK"

if __name__ == "__main__":

    current_post = Post("http://forum.xda-developers.com/login.php", "vb_login_username=test&vb_login_password&securitytoken=guest&do=login", \
                        "Simple")

    #save the time before post data
    origin_time = time.time()

    if(current_post.mode == "Multiple"):

        #multithreading POST

        for i in range(0, 10):
           thread = threading.Thread(target = current_post.post)
           thread.start()
           thread.join()

        #calculate the time interval
        time_interval = time.time() - origin_time

        print time_interval

    if(current_post.mode == "Simple"):

        #simple POST

        for i in range(0, 10):
            current_post.post()

        #calculate the time interval
        time_interval = time.time() - origin_time

        print time_interval

如您所见，这是一个非常简单的代码。首先我将模式设置为“简单”，我可以得到时间间隔：50s（也许我的速度有点慢:()。然后我将模式设置为“多个”，我得到时间间隔：35。从中我可以看到，多线程实际上可以提高速度，但结果并没有我想象的那么好。我想获得更快的速度。

从调试中发现程序主要阻塞在: 行open_url = urllib2.urlopen(req, self.data)，这行代码从指定网站发布和接收数据需要花费大量时间。time.sleep()我想也许我可以通过在函数中添加和使用多线程来获得更快的速度urlopen，但我不能这样做，因为它是 python 自己的函数。

如果不考虑服务器阻止发布速度的可能限制，我还能做些什么来获得更快的速度？或者我可以修改的任何其他代码？多谢！

score 16 · Accepted Answer

您做错的最大的事情，即最损害您的吞吐量，是您调用的方式thread.start()和thread.join()：

for i in range(0, 10):
   thread = threading.Thread(target = current_post.post)
   thread.start()
   thread.join()

每次通过循环，您都会创建一个线程，启动它，然后等待它完成，然后再继续下一个线程。你根本没有同时做任何事情！

你可能应该做的是：

threads = []

# start all of the threads
for i in range(0, 10):
   thread = threading.Thread(target = current_post.post)
   thread.start()
   threads.append(thread)

# now wait for them all to finish
for thread in threads:
   thread.join()

score 12 · Accepted Answer

在许多情况下，python 的线程并不能很好地提高执行速度……有时，它会使情况变得更糟。有关更多信息，请参阅David Beazley 在 Global Interpreter Lock / Pycon2010 GIL 幻灯片上的 PyCon2010 演示。此演示文稿内容丰富，我强烈推荐给任何考虑线程的人...

尽管 David Beazley 的演讲解释了网络流量改进了 Python 线程模块的调度，但您应该使用multiprocessing 模块。我将此作为选项包含在您的代码中（请参阅我的答案的底部）。

在我的一台旧机器（Python 2.6.6）上运行它：

current_post.mode == "Process"  (multiprocessing)  --> 0.2609 seconds
current_post.mode == "Multiple" (threading)        --> 0.3947 seconds
current_post.mode == "Simple"   (serial execution) --> 1.650 seconds

我同意 TokenMacGuy 的评论，上面的数字包括移动.join()到不同的循环。如您所见，python 的多处理比线程快得多。

from multiprocessing import Process
import threading
import time
import urllib
import urllib2


class Post:

    def __init__(self, website, data, mode):
        self.website = website
        self.data = data

        #mode is either:
        #   "Simple"      (Simple POST)
        #   "Multiple"    (Multi-thread POST)
        #   "Process"     (Multiprocessing)
        self.mode = mode
        self.run_job()

    def post(self):

        #post data
        req = urllib2.Request(self.website)
        open_url = urllib2.urlopen(req, self.data)

        if self.mode == "Multiple":
            time.sleep(0.001)

        #read HTMLData
        HTMLData = open_url.read()

        #print "OK"

    def run_job(self):
        """This was refactored from the OP's code"""
        origin_time = time.time()
        if(self.mode == "Multiple"):

            #multithreading POST
            threads = list()
            for i in range(0, 10):
               thread = threading.Thread(target = self.post)
               thread.start()
               threads.append(thread)
            for thread in threads:
               thread.join()
            #calculate the time interval
            time_interval = time.time() - origin_time
            print "mode - {0}: {1}".format(method, time_interval)

        if(self.mode == "Process"):

            #multiprocessing POST
            processes = list()
            for i in range(0, 10):
               process = Process(target=self.post)
               process.start()
               processes.append(process)
            for process in processes:
               process.join()
            #calculate the time interval
            time_interval = time.time() - origin_time
            print "mode - {0}: {1}".format(method, time_interval)

        if(self.mode == "Simple"):

            #simple POST
            for i in range(0, 10):
                self.post()
            #calculate the time interval
            time_interval = time.time() - origin_time
            print "mode - {0}: {1}".format(method, time_interval)
        return time_interval

if __name__ == "__main__":

    for method in ["Process", "Multiple", "Simple"]:
        Post("http://forum.xda-developers.com/login.php", 
            "vb_login_username=test&vb_login_password&securitytoken=guest&do=login",
            method
            )

score 2 · Accepted Answer

请记住，在 Python 中，多线程可以“提高速度”的唯一情况是当您执行像这样的操作时I/O 受限。否则，多线程不会提高“速度”，因为它不能在多个 CPU 上运行（不，即使你有多个内核，python 也不能那样工作）。当您希望同时完成两件事时，您应该使用多线程，而不是当您希望两件事并行时（即两个进程分开运行）。

现在，您实际上所做的实际上不会提高任何单个 DNS 查找的速度，但它会允许在等待其他一些请求的结果时触发多个请求，但您应该注意您做了多少否则您只会使响应时间比现在更糟。

也请停止使用 urllib2，并使用请求：http ://docs.python-requests.org

score 0 · Accepted Answer

DNS 查找需要时间。你对此无能为力。大延迟是首先使用多个线程的一个原因 - 多个查找广告网站 GET/POST 然后可以并行发生。

转储 sleep() - 它没有帮助。

python - 在python中使用多线程时如何获得更快的速度

4 回答 4

Related