python - python performance of httplib (discogs API)

Question

I wrote a short prog which uses the Discogs API with python, but it is so damn slow thats not usable for real web-applications. Here is the Python code and the python profile results (published only the time consuming spots) :

# -*- coding: utf-8 -*-

import profile
import discogs_client as discogs

def main():
    discogs.user_agent = 'Mozilla/5.0'
    #dump released albums into the file. You could also print it to the console
    f=open('DiscogsTestResult.txt', 'w+')

    #Use another band if you like, 
    #but if you decide to take "beatles" you will wait an hour! (cause of the num of releases)
    artist = discogs.Artist('Faust')
    print >> f, artist
    print  >> f," "

    artistReleases = artist.releases
    for r in artistReleases:
        print >> f, r.data
        print >> f,"---------------------------------------------"


print 'Performance Analysis of Discogs API'
print '=' * 80
profile.run('print main(); print')

and here is the result of pythons profile:

Performance Analysis of Discogs API
================================================================================
   82807 function calls (282219 primitive calls) in 177.544 seconds
   Ordered by: standard name
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      188  121.013    0.644  121.013    0.644 :0(connect)
      206   52.080    0.253   52.080    0.253 :0(recv)
        1    0.036    0.036  177.494  177.494 <string>:1(<module>)
      188    0.013    0.000  175.234    0.932 adapters.py:261(send)
      376    0.005    0.000    0.083    0.000 adapters.py:94(init_poolmanager)
      188    0.008    0.000  176.569    0.939 api.py:17(request)
      188    0.007    0.000  176.577    0.939 api.py:47(get)
      188    0.015    0.000  173.922    0.925 connectionpool.py:268(_make_request)
      188    0.015    0.000  174.034    0.926 connectionpool.py:332(urlopen)
        1    0.496    0.496  177.457  177.457 discogsTestFullDump.py:6(main)
      564    0.009    0.000  176.613    0.313 discogs_client.py:66(_response)
      188    0.012    0.000  176.955    0.941 discogs_client.py:83(data)
      188    0.011    0.000   51.759    0.275 httplib.py:363(_read_status)
      188    0.017    0.000   52.520    0.279 httplib.py:400(begin)
      188    0.003    0.000  121.198    0.645 httplib.py:754(connect)
      188    0.007    0.000  121.270    0.645 httplib.py:772(send)
      188    0.005    0.000  121.276    0.645 httplib.py:799(_send_output)
      188    0.003    0.000  121.279    0.645 httplib.py:941(endheaders)
      188    0.003    0.000  121.348    0.645 httplib.py:956(request)
      188    0.016    0.000  121.345    0.645 httplib.py:977(_send_request)
      188    0.009    0.000   52.541    0.279 httplib.py:994(getresponse)
        1    0.000    0.000  177.544  177.544 profile:0(print main(); print)
      188    0.032    0.000  176.322    0.938 sessions.py:225(request)
      188    0.030    0.000  175.513    0.934 sessions.py:408(send)
      752    0.015    0.000  121.088    0.161 socket.py:223(meth)
     2256    0.224    0.000   52.127    0.023 socket.py:406(readline)
      188    0.009    0.000  121.195    0.645 socket.py:537(create_connection)

Does anybody has any idea how to speed this up. I hope that whith some changes in the discogs_client.py it would be faster. Maybe changing from httplib to something else, or whatever. Or mybe it is faster to use another protocol instead of http?

(The source of discogs_client.py can be accessed here :"https://github.com/discogs/discogs_client/blob/master/discogs_client.py")

If anybody has any idea please respond, a lot of people would benefit from this.

Regards Daniel

score 2 · Accepted Answer

更新：来自 discogs 文档：Requests are throttled by the server to one per second per IP address. Your application should (but doesnt have to) take this into account and throttle requests locally, too.

瓶颈似乎在（discogs）服务器端，检索单个版本。除了给他们钱来购买更快的服务器之外，您实际上无能为力。

我的建议是缓存结果，这可能是唯一有帮助的。重写discogs.APIBase._response，如下：

def _response(self):
    if not self._cached_response:
        self._cached_response=self._load_response_from_disk()
    if not self._cached_response:
        if not self._check_user_agent():
            raise UserAgentError("Invalid or no User-Agent set.")
        self._cached_response = requests.get(self._uri, params=self._params, headers=self._headers)
        self._save_response_to_disk()

    return self._cached_response

另一种方法是将请求写入日志并说“我们不知道，稍后再试”，然后在另一个进程中，读取日志，下载数据，将其存储在数据库中。然后当他们稍后回来时，所请求的数据将在那里准备好。

您需要自己编写 _load_response_from_disk() 和 _save_response_to_disk() - 存储的数据应该_uri, _params, and _headers作为键，并且应该包含数据的时间戳。如果数据太旧（在这种情况下，我建议以月为单位 - 我不知道编号是否是持久的 - 我猜想最初尝试几天 - 几周），或者找不到，返回 None。存储必须处理并发访问和快速索引——可能是一个数据库。

score 0 · Accepted Answer

0

试试这个：不要使用 print >> 来写入文件，而是使用 f.write('hello\n')。

于 2013-07-03T20:56:14.773 回答

python - python performance of httplib (discogs API)

2 回答 2

Related

Reference