0

I wrote a short prog which uses the Discogs API with python, but it is so damn slow thats not usable for real web-applications. Here is the Python code and the python profile results (published only the time consuming spots) :

# -*- coding: utf-8 -*-

import profile
import discogs_client as discogs

def main():
    discogs.user_agent = 'Mozilla/5.0'
    #dump released albums into the file. You could also print it to the console
    f=open('DiscogsTestResult.txt', 'w+')

    #Use another band if you like, 
    #but if you decide to take "beatles" you will wait an hour! (cause of the num of releases)
    artist = discogs.Artist('Faust')
    print >> f, artist
    print  >> f," "

    artistReleases = artist.releases
    for r in artistReleases:
        print >> f, r.data
        print >> f,"---------------------------------------------"


print 'Performance Analysis of Discogs API'
print '=' * 80
profile.run('print main(); print')

and here is the result of pythons profile:

Performance Analysis of Discogs API
================================================================================
   82807 function calls (282219 primitive calls) in 177.544 seconds
   Ordered by: standard name
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      188  121.013    0.644  121.013    0.644 :0(connect)
      206   52.080    0.253   52.080    0.253 :0(recv)
        1    0.036    0.036  177.494  177.494 <string>:1(<module>)
      188    0.013    0.000  175.234    0.932 adapters.py:261(send)
      376    0.005    0.000    0.083    0.000 adapters.py:94(init_poolmanager)
      188    0.008    0.000  176.569    0.939 api.py:17(request)
      188    0.007    0.000  176.577    0.939 api.py:47(get)
      188    0.015    0.000  173.922    0.925 connectionpool.py:268(_make_request)
      188    0.015    0.000  174.034    0.926 connectionpool.py:332(urlopen)
        1    0.496    0.496  177.457  177.457 discogsTestFullDump.py:6(main)
      564    0.009    0.000  176.613    0.313 discogs_client.py:66(_response)
      188    0.012    0.000  176.955    0.941 discogs_client.py:83(data)
      188    0.011    0.000   51.759    0.275 httplib.py:363(_read_status)
      188    0.017    0.000   52.520    0.279 httplib.py:400(begin)
      188    0.003    0.000  121.198    0.645 httplib.py:754(connect)
      188    0.007    0.000  121.270    0.645 httplib.py:772(send)
      188    0.005    0.000  121.276    0.645 httplib.py:799(_send_output)
      188    0.003    0.000  121.279    0.645 httplib.py:941(endheaders)
      188    0.003    0.000  121.348    0.645 httplib.py:956(request)
      188    0.016    0.000  121.345    0.645 httplib.py:977(_send_request)
      188    0.009    0.000   52.541    0.279 httplib.py:994(getresponse)
        1    0.000    0.000  177.544  177.544 profile:0(print main(); print)
      188    0.032    0.000  176.322    0.938 sessions.py:225(request)
      188    0.030    0.000  175.513    0.934 sessions.py:408(send)
      752    0.015    0.000  121.088    0.161 socket.py:223(meth)
     2256    0.224    0.000   52.127    0.023 socket.py:406(readline)
      188    0.009    0.000  121.195    0.645 socket.py:537(create_connection)

Does anybody has any idea how to speed this up. I hope that whith some changes in the discogs_client.py it would be faster. Maybe changing from httplib to something else, or whatever. Or mybe it is faster to use another protocol instead of http?

(The source of discogs_client.py can be accessed here :"https://github.com/discogs/discogs_client/blob/master/discogs_client.py")

If anybody has any idea please respond, a lot of people would benefit from this.

Regards Daniel

4

2 回答 2

2

更新:来自 discogs 文档:Requests are throttled by the server to one per second per IP address. Your application should (but doesnt have to) take this into account and throttle requests locally, too.

瓶颈似乎在(discogs)服务器端,检索单个版本。除了给他们钱来购买更快的服务器之外,您实际上无能为力。

我的建议是缓存结果,这可能是唯一有帮助的。重写discogs.APIBase._response,如下:

def _response(self):
    if not self._cached_response:
        self._cached_response=self._load_response_from_disk()
    if not self._cached_response:
        if not self._check_user_agent():
            raise UserAgentError("Invalid or no User-Agent set.")
        self._cached_response = requests.get(self._uri, params=self._params, headers=self._headers)
        self._save_response_to_disk()

    return self._cached_response

另一种方法是将请求写入日志并说“我们不知道,稍后再试”,然后在另一个进程中,读取日志,下载数据,将其存储在数据库中。然后当他们稍后回来时,所请求的数据将在那里准备好。

您需要自己编写 _load_response_from_disk() 和 _save_response_to_disk() - 存储的数据应该_uri, _params, and _headers作为键,并且应该包含数据的时间戳。如果数据太旧(在这种情况下,我建议以月为单位 - 我不知道编号是否是持久的 - 我猜想最初尝试几天 - 几周),或者找不到,返回 None。存储必须处理并发访问和快速索引——可能是一个数据库。

于 2013-07-03T21:11:04.477 回答
0

试试这个:不要使用 print >> 来写入文件,而是使用 f.write('hello\n')。

于 2013-07-03T20:56:14.773 回答