Will running multiple processes that all make HTTP requests be notably faster than one?
I'm parsing about a million urls using lxml.html.parse
At first, I ran a Python process that simply looped through the urls and called lxml.html.parse(myUrl)
on each, and waited for the rest of the method to deal with the data before doing so again. This way, I was able to process on the order of 10000
urls/hour.
I imagined that if I ran a few identical processes (dealing with different sets of urls), I would speed up the rate at which I could fetch these urls. Surprisingly, (to me at least), I measured about 10400
urls/hour this time, which isn't notably better, considering I'm sure both were fluctuating dramatically.
My question is: why isn't running three of these processes much faster than one?
I know for a fact that my requests aren't meaningfully affecting their target in any way, so I don't think it's them. Do I not have enough bandwidth to make these extra processes worthwhile? If not, how can I measure this? Am I totally misunderstanding how my MacBook is running these processes? (I'm assuming on different cores concurrent threads, or something roughly equivalent to that.) Something else entirely?
(Apologies if I mangled any web terminology -- I'm new to this kind of stuff. Corrections are appreciated.)
Note: I imagine that running these processes on three different servers would probably be about 3x as fast. (That correct?) I'm not interested in that -- worst case, 10000/hour is sufficient for my purposes.
Edit: from speedtest.net (twice):
With 3 running:
Ping: 29 ms (25 ms)
Download speed: 6.63 mbps (7.47 mbps)
Upload speed: 3.02 mbps (3.32 mbps)
With all paused:
Ping: 26 ms (28 ms)
Download speed: 9.32 mbps (8.82 mbps)
Upload speed: 5.15 mbps (6.56 mbps)