0

Will running multiple processes that all make HTTP requests be notably faster than one?

I'm parsing about a million urls using lxml.html.parse

At first, I ran a Python process that simply looped through the urls and called lxml.html.parse(myUrl) on each, and waited for the rest of the method to deal with the data before doing so again. This way, I was able to process on the order of 10000 urls/hour.

I imagined that if I ran a few identical processes (dealing with different sets of urls), I would speed up the rate at which I could fetch these urls. Surprisingly, (to me at least), I measured about 10400 urls/hour this time, which isn't notably better, considering I'm sure both were fluctuating dramatically.

My question is: why isn't running three of these processes much faster than one?

I know for a fact that my requests aren't meaningfully affecting their target in any way, so I don't think it's them. Do I not have enough bandwidth to make these extra processes worthwhile? If not, how can I measure this? Am I totally misunderstanding how my MacBook is running these processes? (I'm assuming on different cores concurrent threads, or something roughly equivalent to that.) Something else entirely?

(Apologies if I mangled any web terminology -- I'm new to this kind of stuff. Corrections are appreciated.)

Note: I imagine that running these processes on three different servers would probably be about 3x as fast. (That correct?) I'm not interested in that -- worst case, 10000/hour is sufficient for my purposes.

Edit: from speedtest.net (twice):

With 3 running:
Ping: 29 ms (25 ms)
Download speed: 6.63 mbps (7.47 mbps)
Upload speed: 3.02 mbps (3.32 mbps)

With all paused:
Ping: 26 ms (28 ms)
Download speed: 9.32 mbps (8.82 mbps)
Upload speed: 5.15 mbps (6.56 mbps)
4

1 回答 1

2

考虑到您大约有7mbit/s(1MB/s 计数高)。如果你得到2.888 pages per second(每小时 10'400 页)。我会说您正在最大限度地提高连接速度(特别是如果您正在运行 ADSL 或 WiFi,那么您肯定会使用 TCP 连接握手)。

您正在下载一个大致包含354kB每个进程中数据的页面,考虑到这接近您的带宽限制,这还不错。

考虑 TCP 标头以及实际建立连接时发生的所有事情(SYN、ACK .. 等),您的下降速度很快。

注意:这只是考虑到下载速率远高于您的上传速度,这也很重要,因为这实际上是将您的连接请求、标头传输到 Web 服务器等。而且我知道大多数 3G 调制解调器和 ADSL 线路声称要成为“全双工”,它们确实不是(尤其是 ADSL)。尽管你的 ISP 告诉你什么,你永远不会在两个方向上全速运行。如果你想完成这样的任务,你需要改用光纤。

附言。我假设您了解兆位和兆字节之间的基本区别。

于 2013-08-15T08:25:36.677 回答