我正在编写一个客户端,它一次加载和解析许多页面并将数据从它们发送到服务器。如果我一次只运行一个页面处理器,事情就会顺利进行:
********** Round-trip (with 0 sends/0 loads) for (+0/.0/-0) was total 1.98s (1.60s load html, 0.24s parse, 0.00s on queue, 0.14s to process) **********
********** Round-trip (with 0 sends/0 loads) for (+0/.0/-0) was total 1.87s (1.59s load html, 0.25s parse, 0.00s on queue, 0.03s to process) **********
********** Round-trip (with 0 sends/0 loads) for (+0/.0/-0) was total 2.79s (1.78s load html, 0.28s parse, 0.00s on queue, 0.72s to process) **********
********** Round-trip (with 0 sends/1 loads) for (+0/.0/-0) was total 2.18s (1.70s load html, 0.34s parse, 0.00s on queue, 0.15s to process) **********
********** Round-trip (with 0 sends/1 loads) for (+0/.0/-0) was total 1.91s (1.47s load html, 0.21s parse, 0.00s on queue, 0.23s to process) **********
********** Round-trip (with 0 sends/1 loads) for (+0/.0/-0) was total 1.84s (1.59s load html, 0.22s parse, 0.00s on queue, 0.03s to process) **********
********** Round-trip (with 0 sends/0 loads) for (+0/.0/-0) was total 1.90s (1.67s load html, 0.21s parse, 0.00s on queue, 0.02s to process) **********
但是,由于同时运行约 20 个(每个都在自己的线程中),HTTP 流量变得非常慢:
********** Round-trip (with 2 sends/7 loads) for (+0/.0/-0) was total 23.37s (16.39s load html, 0.30s parse, 0.00s on queue, 6.67s to process) **********
********** Round-trip (with 2 sends/5 loads) for (+0/.0/-0) was total 20.99s (14.00s load html, 1.99s parse, 0.00s on queue, 5.00s to process) **********
********** Round-trip (with 4 sends/4 loads) for (+0/.0/-0) was total 17.89s (9.17s load html, 0.30s parse, 0.12s on queue, 8.31s to process) **********
********** Round-trip (with 3 sends/5 loads) for (+0/.0/-0) was total 26.22s (15.34s load html, 1.63s parse, 0.01s on queue, 9.24s to process) **********
该load html
位是读取我正在处理的网页的 HTML 所需的时间(resp = self.mech.open(url)
到resp.read(); resp.close()
)。该to process
位是从该客户端到处理它的服务器进行往返所需的时间(fp = urllib2.urlopen(...); fp.read(); fp.close()
)。该X sends/Y loads
位是同时发送到服务器并从我正在处理的网页加载的数量,这些网页在向服务器发出请求时正在运行。
我最关心的是那个to process
位。服务器上的实际处理只需要0.2s
左右。只发送了400 个字节,所以这不是占用太多带宽的问题。有趣的是,如果我运行一个打开 5 个线程并重复执行此操作的程序(同时进行所有这些同时发送/加载的解析)to process
,它的运行速度非常快:
1 took 0.04s
1 took 1.41s in total
0 took 0.03s
0 took 1.43s in total
4 took 0.33s
2 took 0.49s
2 took 0.08s
2 took 0.01s
2 took 1.74s in total
3 took 0.62s
4 took 0.40s
3 took 0.31s
4 took 0.33s
3 took 0.05s
3 took 2.18s in total
4 took 0.07s
4 took 2.22s in total
to process
这个独立程序中的每个只需要0.01s
to 0.50s
,远远少于成熟版本中的 6-10 秒,并且它使用的发送线程并没有减少(它使用 5 个,并且成熟版本的上限为5)。
也就是说,当完整版本正在运行时,运行一个单独的版本发送这些相同(+0/.0/-0)
的请求,每个请求 400 字节,0.31
每个请求只需要 s。所以,它不像我正在运行的机器被窃听......似乎其他线程中的多个同时加载正在减慢应该快速的速度(实际上是快速的,在另一个运行的程序中同一台机器)在其他线程中发送。
发送是用 完成的urllib2.urlopen
,而读取是用 mechanize 完成的(最终使用 fork urllib2.urlopen
)。
有没有办法让完整的程序像这个迷你独立版本一样快速运行,至少在他们发送相同的东西时?我正在考虑编写另一个程序,它只接收通过命名管道或其他东西发送的内容,以便发送在另一个进程中完成,但这似乎很愚蠢,不知何故。欢迎大家提出意见。
任何有关如何更快地同时加载多个页面的建议(因此时间看起来更像 1-3 秒而不是 10-20 秒)也将受到欢迎。
编辑:附加说明:我依赖 mechanize 的 cookie 处理功能,因此理想情况下,任何答案都将提供一种处理该问题的方法,以及......
编辑:我用不同的配置进行了相同的设置,其中只打开一页,一次将大约 10-20 个内容添加到队列中。那些像刀穿过黄油一样被加工,例如,这是添加一大堆的结尾:
********** Round-trip (with 4 sends/0 loads) for (+0/.0/-0) was total 1.17s (1.14s wait, 0.04s to process) **********
********** Round-trip (with 4 sends/0 loads) for (+0/.0/-0) was total 1.19s (1.16s wait, 0.03s to process) **********
********** Round-trip (with 4 sends/0 loads) for (+0/.0/-0) was total 1.26s (0.80s wait, 0.46s to process) **********
********** Round-trip (with 4 sends/0 loads) for (+0/.0/-0) was total 1.35s (0.77s wait, 0.58s to process) **********
********** Round-trip (with 4 sends/0 loads) for (+2/.4/-0) was total 1.44s (0.24s wait, 1.20s to process) **********
(我添加了wait
时间,即信息在发送之前在队列中停留的时间。)请注意,它与to process
独立程序一样快。该问题仅体现在不断阅读和解析网页的问题上。(请注意,解析本身会占用大量 CPU)。
编辑:一些初步测试表明我应该为每个网页加载使用单独的过程......一旦启动并运行,将发布更新。