12

I have some questions about the performance of this simple python script:

import sys, urllib2, asyncore, socket, urlparse
from timeit import timeit

class HTTPClient(asyncore.dispatcher):
    def __init__(self, host, path):
        asyncore.dispatcher.__init__(self)
        self.create_socket(socket.AF_INET, socket.SOCK_STREAM)
        self.connect( (host, 80) )
        self.buffer = 'GET %s HTTP/1.0\r\n\r\n' % path
        self.data = ''
    def handle_connect(self):
        pass
    def handle_close(self):
        self.close()
    def handle_read(self):
        self.data += self.recv(8192)
    def writable(self):
        return (len(self.buffer) > 0)
    def handle_write(self):
        sent = self.send(self.buffer)
        self.buffer = self.buffer[sent:]

url = 'http://pacnet.karbownicki.com/api/categories/'

components = urlparse.urlparse(url)
host = components.hostname or ''
path = components.path

def fn1():
    try:
        response = urllib2.urlopen(url)
        try:
            return response.read()
        finally:
            response.close()
    except:
        pass

def fn2():
    client = HTTPClient(host, path)
    asyncore.loop()
    return client.data

if sys.argv[1:]:
    print 'fn1:', len(fn1())
    print 'fn2:', len(fn2())

time = timeit('fn1()', 'from __main__ import fn1', number=1)
print 'fn1: %.8f sec/pass' % (time)

time = timeit('fn2()', 'from __main__ import fn2', number=1)
print 'fn2: %.8f sec/pass' % (time)

Here's the output I'm getting on linux:

$ python2 test_dl.py
fn1: 5.36162281 sec/pass
fn2: 0.27681994 sec/pass

$ python2 test_dl.py count
fn1: 11781
fn2: 11965
fn1: 0.30849886 sec/pass
fn2: 0.30597305 sec/pass

Why is urllib2 so much slower than asyncore in the first run?

And why does the discrepancy seem to disappear on the second run?

EDIT: Found a hackish solution to this problem here: Force python mechanize/urllib2 to only use A requests?

The five-second delay disappears if I monkey-patch the socket module as follows:

_getaddrinfo = socket.getaddrinfo

def getaddrinfo(host, port, family=0, socktype=0, proto=0, flags=0):
    return _getaddrinfo(host, port, socket.AF_INET, socktype, proto, flags)

socket.getaddrinfo = getaddrinfo
4

3 回答 3

1

终于找到了一个很好的解释是什么导致了这个问题,以及为什么:

这是 DNS 解析器的问题。

对于 DNS 解析程序不支持的任何 DNS 请求,都会出现此问题。正确的解决方案是修复 DNS 解析器。

怎么了:

  • 程序已启用 IPv6。
  • 当它查找主机名时,getaddrinfo() 首先询问 AAAA 记录
  • DNS 解析器看到对 AAAA 记录的请求,然后说“嗯,我不知道它是什么,让我们把它扔掉吧”
  • DNS 客户端(libc 中的 getaddrinfo())等待响应.....由于没有响应而必须超时。(这是延迟)
  • 尚未收到任何记录,因此 getaddrinfo() 用于 A 记录请求。这行得通。
  • 程序获取 A 记录并使用这些记录。

这不仅会影响 IPv6 (AAAA) 记录,还会影响解析器不支持的任何其他 DNS 记录。

对我来说,解决方案是安装dnsmasq(但我想任何其他 DNS 解析器都可以)。

于 2012-04-05T18:44:29.493 回答
0

This probably is in your OS: If your OS caches DNS requests, the first request has to be answered by a DNS Server, subsequent requests for the same name are already at hand.

EDIT: as the comments show, it's probably not a DNS problem. I still maintain that it's the OS and not python. I've tested the code both on Windows and on FreeBSD and didn't see this kind of difference, both functions need about the same time.

Which is exactly how it should be, there shouldn't be a significant difference for a single request. I/O and network latency make up probably about 90% of these timings.

于 2011-10-07T19:26:35.567 回答
0

你试过反过来吗?即首先通过同步和urllib?

案例 1:我们先尝试使用 urllib,然后使用 ayncore。

fn1: 1.48460957 sec/pass
fn2: 0.91280798 sec/pass

观察:Ayncore 在 0.57180159 秒内完成了相同的操作

让我们扭转它。

案例 2:我们现在尝试使用 ayncore,然后使用 urllib。

fn2: 1.27898671 sec/pass
fn1: 0.95816954 sec/pass the same operation in 0.12081717

观察:这次 Urllib 比 asyncore 花费了 0.32081717 秒

这里有两个结论:

  1. urllib2 总是比 asyncore 花费更多的时间,这是因为 urllib2 将套接字系列类型定义为未指定,而 asyncore 让用户定义它,在这种情况下,我们将其定义为 AF_INET IPv4 协议。

  2. 如果不考虑 ayncore 或 urllib 为同一服务器创建两个套接字,则第二个套接字会执行得更好。这是因为默认缓存行为。要了解更多信息,请查看:https ://stackoverflow.com/a/6928657/1060337

参考:

想大致了解套接字的工作原理吗?

http://www.cs.odu.edu/~mweigle/courses/cs455-f06/lectures/2-1-ClientServer.pdf

想用python编写自己的套接字吗?

http://www.ibm.com/developerworks/linux/tutorials/l-pysocks/index.html

要了解套接字系列或通用术语,请查看此 wiki:

http://en.wikipedia.org/wiki/Berkeley_sockets

注意:此答案最后更新于 2012 年 4 月 5 日凌晨 2 点 IST

于 2012-04-04T02:27:18.613 回答