python - 从数据库加载大数据并转换为 JSON 时如何提高性能

Question

我编写了一个处理财务数据处理的 Djano 应用程序。我必须从 MySQL 表中加载大数据（超过 1000000 条记录），并将记录转换为 django 视图中的 JSON 数据，如下所示：

trades = MtgoxTrade.objects.all()
data = []
for trade in trades:
            js = dict()
            js['time']= trade.time
            js['price']= trade.price
            js['amount']= trade.amount
            js['type']= trade.type
            data.append(js)
return data

问题是 FOR 循环非常慢（200000 条记录需要 9 秒以上），有没有有效的方法将 DB 记录转换为 Python 中的 JSON 格式数据？

更新：我根据 Mike Housky 在我的 ENV(ActivePython2.7,Win7) 中的回答运行了代码，代码更改为：

def create_data(n):
    from api.models import MtgoxTrade
    result = MtgoxTrade.objects.all()

    return result


  Build ............ 0.330999851227
  For loop ......... 7.98400020599
  List Comp. ....... 0.457000017166
  Ratio ............ 0.0572394796312
  For loop 2 ....... 0.381999969482
  Ratio ............ 0.047845686326

你会发现 for 循环大约需要 8 秒！如果我注释掉 For 循环，那么 List Comp 也需要这样的时间：

Times:
  Build ............ 0.343000173569
  List Comp. ....... 7.57099986076
  For loop 2 ....... 0.375999927521

我的新问题是for循环是否会触及数据库？但我没有看到任何数据库访问日志。这么奇怪！

score 3 · Accepted Answer

这里有几个提示/尝试的事情。

由于您最终需要从查询集中创建一个 JSON 字符串，因此请使用 django 的内置序列化程序：

from django.core import serializers

data = serializers.serialize("json", 
                             MtgoxTrade.objects.all(), 
                             fields=('time','price','amount','type'))

您可以使用ujson或simplejson模块使序列化更快。请参阅SERIALIZATION_MODULES设置。

此外，不要从记录中获取所有字段值，而是明确并仅获取您需要序列化的内容：

MtgoxTrade.objects.all().values('time','price','amount','type')

此外，您可能希望使用查询集的iterator()方法：

...对于返回大量对象的查询集，您只需要访问一次，这可以提高性能并显着减少内存...

此外，您可以将庞大的查询集拆分为批次，请参阅：批量查询集。

另见：

score 2 · Accepted Answer

您可以使用列表理解，因为它可以防止许多dict()和append()调用：

trades = MtgoxTrade.objects.all()
data = [{'time': trade.time, 'price': trade.price, 'amount': trade.amount, 'type': trade.type}
        for trade in trades]
return data

Python 中的函数调用很昂贵，因此您应该避免在慢循环中使用它们。

score 1 · Accepted Answer

This answer is in support of Simeon Visser's observation. I ran the following code:

import gc, random, time
if "xrange" not in dir(__builtins__):
    xrange = range

class DataObject(object):
    def __init__(self, time, price, amount, type):
        self.time = time
        self.price = price
        self.amount = amount
        self.type = type

def create_data(n):
    result = []
    for index in xrange(n):
        s = str(index);
        result.append(DataObject("T"+s, "P"+s, "A"+s, "ty"+s))
    return result

def convert1(trades):
    data = []
    for trade in trades:
                js = dict()
                js['time']= trade.time
                js['price']= trade.price
                js['amount']= trade.amount
                js['type']= trade.type
                data.append(js)
    return data

def convert2(trades):
    data = [{'time': trade.time, 'price': trade.price, 'amount': trade.amount, 'type': trade.type}
        for trade in trades]
    return data

def convert3(trades):
    ndata = len(trades)
    data = ndata*[None]
    for index in xrange(ndata):
        t = trades[index]
        js = dict()
        js['time']= t.time
        js['price']= t.price
        js['amount']= t.amount
        js['type']= t.type
        #js = {"time" : t.time, "price" : t.price, "amount" : t.amount, "type" : t.type}
    return data

def main(n=1000000):

    t0s = time.time()
    trades = create_data(n);
    t0f = time.time()
    t0 = t0f - t0s

    gc.disable()

    t1s = time.time()
    jtrades1 = convert1(trades)
    t1f = time.time()
    t1 = t1f - t1s

    t2s = time.time()
    jtrades2 = convert2(trades)
    t2f = time.time()
    t2 = t2f - t2s

    t3s = time.time()
    jtrades3 = convert3(trades)
    t3f = time.time()
    t3 = t3f - t3s

    gc.enable()

    print ("Times:")
    print ("  Build ............ " + str(t0))
    print ("  For loop ......... " + str(t1))
    print ("  List Comp. ....... " + str(t2))
    print ("  Ratio ............ " + str(t2/t1))
    print ("  For loop 2 ....... " + str(t3))
    print ("  Ratio ............ " + str(t3/t1))

main()

Results on Win7, Core 2 Duo 3.0GHz: Python 2.7.3:

Times:
  Build ............ 2.95600008965
  For loop ......... 0.699999809265
  List Comp. ....... 0.512000083923
  Ratio ............ 0.731428890618
  For loop 2 ....... 0.609999895096
  Ratio ............ 0.871428659011

Python 3.3.0:

Times:
  Build ............ 3.4320058822631836
  For loop ......... 1.0200011730194092
  List Comp. ....... 0.7500009536743164
  Ratio ............ 0.7352942070195492
  For loop 2 ....... 0.9500019550323486
  Ratio ............ 0.9313733946208623

Those vary a bit, even with GC disabled (much more variance with GC enabled, but about the same results). The third conversion timing shows that a fair-sized chunk of the saved time comes from not calling .append() a million times.

Ignore the "For loop 2" times. This version has a bug and I am out of time to fix it for now.

score 0 · Accepted Answer

我认为尝试对数据库进行原始查询是值得的，因为模型将大量额外的样板代码放入字段中（我相信字段是属性），并且像前面提到的函数调用一样昂贵。请参阅文档，页面底部有一个使用 dictfetchall 的示例，这似乎是您所追求的。

score 0 · Accepted Answer

首先，您必须检查从数据库或循环内部获取数据时是否发生性能损失。

没有真正的选择可以为您提供显着的加速 - 也没有使用上面提到的列表理解。

然而，Python 2 和 3 之间的性能存在巨大差异。

一个简单的基准测试告诉我，使用 Python 3.3 的 for 循环大约快 2.5 倍（使用一些简单的基准测试，如下所示）：

import time

ts = time.time()
data = list()
for i in range(1000000):
    d = {}
    d['a'] = 1
    d['b'] = 2
    d['c'] = 3
    d['d'] = 4
    d['a'] = 5
    data.append(d)

print(time.time() - ts)



/opt/python-3.3.0/bin/python3 foo2.py 
0.5906929969787598

python2.6 foo2.py 
1.74390792847

python2.7 foo2.py 
0.673550128937

您还会注意到 Python 2.6 和 2.7 之间存在显着的性能差异。

score 0 · Accepted Answer

您可能想查看values 方法。它将返回一个可迭代的 dicts 而不是模型对象，因此您不必创建很多中间数据结构。您的代码可以简化为

return MtgoxTrade.objects.values('time', 'price', 'amount', 'type')

python - 从数据库加载大数据并转换为 JSON 时如何提高性能

6 回答 6

Related

Reference