python - 使用 mptt 在 Python / Django 中创建 JSON 以反映树结构的最快方法

Question

Python (Django) 中基于 Django 查询集创建 JSON 的最快方法是什么。请注意，在此处建议的模板中对其进行解析不是一种选择。

背景是我创建了一个循环遍历树中所有节点的方法，但是在转换大约 300 个节点时已经非常慢了。我想到的第一个（可能也是最糟糕的）想法是以某种方式“手动”创建 json。请参阅下面的代码。

#! Solution 1 !!#
def quoteStr(input):
    return "\"" + smart_str(smart_unicode(input)) + "\""

def createJSONTreeDump(user, node, root=False, lastChild=False):
    q = "\""

    #open tag for object
    json = str("\n" + indent + "{" +
                  quoteStr("name") + ": " + quoteStr(node.name) + ",\n" +
                  quoteStr("id") + ": " + quoteStr(node.pk) + ",\n" +
                )

    childrenTag = "children"
    children = node.get_children()
    if children.count() > 0 :
        #create children array opening tag
        json += str(indent + quoteStr(childrenTag) + ": [")
        #for child in children:
        for idx, child in enumerate(children):
            if (idx + 1) == children.count():
                //recursive call
                json += createJSONTreeDump(user, child, False, True, layout)
            else:
                //recursive call
                json += createJSONTreeDump(user, child, False, False, layout)
        #add children closing tag
        json += "]\n"

    #closing tag for object
    if lastChild == False:
        #more children following, add ","
        json += indent + "},\n"
    else:
        #last child, do not add ","
        json += indent + "}\n"
    return json

要渲染的树结构是使用mptt构建的树，其中调用 .get_children() 返回节点的所有子节点。

模型看起来就这么简单，mptt 负责其他一切。

class Node(MPTTModel, ExtraManager):
    """
    Representation of a single node
    """ 
    name = models.CharField(max_length=200)
    parent = TreeForeignKey('self', null=True, blank=True, related_name='%(app_label)s_%(class)s_children')

在模板中这样创建的预期 JSON结果var root = {{ jsonTree|safe }}

编辑：基于这个答案，我创建了以下代码（绝对是更好的代码），但感觉只是稍微快一点。

解决方案2：

def serializable_object(node):
    "Recurse into tree to build a serializable object"
    obj = {'name': node.name, 'id': node.pk, 'children': []}
    for child in node.get_children():
        obj['children'].append(serializable_object(child))
    return obj

import json
jsonTree = json.dumps(serializable_object(nodeInstance))

解决方案3：

def serializable_object_List_Comprehension(node):
    "Recurse into tree to build a serializable object"
    obj = {
        'name': node.name,
        'id': node.pk,
        'children': [serializable_object(ch) for ch in node.get_children()]
    }
    return obj

解决方案4：

def recursive_node_to_dict(node):
    result = {
        'name': node.name, 'id': node.pk
    }
    children = [recursive_node_to_dict(c) for c in node.get_children()],
    if children is not None:
        result['children'] = children
    return result

from mptt.templatetags.mptt_tags import cache_tree_children
root_nodes = cache_tree_children(root.get_descendants())
dicts = []
for n in root_nodes:
    dicts.append(recursive_node_to_dict(root_nodes[0]))
    jsonTree = json.dumps(dicts, indent=4)

解决方案5（使用select_related到pre_fetch，但不确定是否正确使用）

def serializable_object_select_related(node):
    "Recurse into tree to build a serializable object, make use of select_related"
    obj = {'name': node.get_wbs_code(), 'wbsCode': node.get_wbs_code(), 'id': node.pk, 'level': node.level, 'position': node.position, 'children': []}
    for child in node.get_children().select_related():
        obj['children'].append(serializable_object(child))
    return obj

解决方案 6（改进的解决方案 4，使用子节点的缓存）：

def recursive_node_to_dict(node):
    return {
        'name': node.name, 'id': node.pk,
         # Notice the use of node._cached_children instead of node.get_children()
        'children' : [recursive_node_to_dict(c) for c in node._cached_children]
    }

通过以下方式调用：

from mptt.templatetags.mptt_tags import cache_tree_children
subTrees = cache_tree_children(root.get_descendants(include_self=True))
subTreeDicts = []
for subTree in subTrees:
    subTree = recursive_node_to_dict(subTree)
    subTreeDicts.append(subTree)
jsonTree = json.dumps(subTreeDicts, indent=4)
#optional clean up, remove the [ ] at the beginning and the end, its needed for D3.js
jsonTree = jsonTree[1:len(jsonTree)]
jsonTree = jsonTree[:len(jsonTree)-1]

您可以在下面看到分析结果，按照 MuMind 的建议使用 cProfile 创建，设置 Django 视图以启动独立方法 profileJSON()，该方法又调用不同的解决方案来创建 JSON 输出。

def startProfileJSON(request):
    print "startProfileJSON"
    import cProfile
    cProfile.runctx('profileJSON()', globals=globals(), locals=locals())
    print "endProfileJSON"

结果：

方案一： 4.969秒内3350347次函数调用（3130372次原语调用）（详情）

解决方案 2： 3.630 秒内 2533705 次函数调用（2354516 次原始调用）（详情）

方案3： 3.684秒内2533621次函数调用（2354441次原语调用）（详情）

解决方案 4： 3.840 秒内 2812725 次函数调用（2466028 次原语调用）（详情）

解决方案 5： 3.779 秒内 2536504 次函数调用（2357256 次原语调用）（详情）

解决方案 6（改进的解决方案 4）： 2593122 个函数调用（2299165 个原始调用）在 3.663 秒内（详情）

讨论：

解决方案1：自己的编码实现。馊主意

解决方案 2 + 3：目前最快，但仍然非常缓慢

解决方案 4：缓存孩子看起来很有希望，但确实执行相似并且当前产生无效的 json，因为孩子被放入双 []：

"children": [[]] instead of "children": []

解决方案 5：使用 select_related 并没有什么不同，但可能以错误的方式使用，因为一个节点总是有一个 ForeignKey 到它的父节点，我们正在从根解析到子节点。

更新：解决方案 6：对我来说，它看起来是最干净的解决方案，使用子节点缓存。但只执行类似于解决方案 2 + 3。这对我来说很奇怪。

还有更多关于性能改进的想法吗？

score 29 · Accepted Answer

我怀疑到目前为止最大的放缓是每个节点将执行 1 个数据库查询。与数据库的数百次往返相比，json 渲染是微不足道的。

您应该在每个节点上缓存子节点，以便可以一次完成所有查询。django-mptt 有一个cache_tree_children()函数，你可以使用它。

import json
from mptt.templatetags.mptt_tags import cache_tree_children

def recursive_node_to_dict(node):
    result = {
        'id': node.pk,
        'name': node.name,
    }
    children = [recursive_node_to_dict(c) for c in node.get_children()]
    if children:
        result['children'] = children
    return result

root_nodes = cache_tree_children(Node.objects.all())
dicts = []
for n in root_nodes:
    dicts.append(recursive_node_to_dict(n))

print json.dumps(dicts, indent=4)

自定义 json 编码虽然在某些情况下可能会提供轻微的加速，但我非常不鼓励这样做，因为它会包含很多代码，而且很容易出错。

score 8 · Accepted Answer

您的更新版本看起来开销很小。我认为使用列表推导会稍微更有效（并且更具可读性！）：

def serializable_object(node):
    "Recurse into tree to build a serializable object"
    obj = {
        'name': node.name,
        'children': [serializable_object(ch) for ch in node.get_children()]
    }
    return obj

除此之外，您所能做的就是对其进行分析以找到瓶颈。编写一些独立的代码来加载和序列化你的 300 个节点，然后运行它

python -m profile serialize_benchmark.py

（或者-m cProfile如果效果更好）。

可以看到 3 个不同的潜在瓶颈：

数据库访问（.get_children()和.name）——我不确定幕后到底发生了什么，但我有这样的代码，它为每个节点执行数据库查询，增加了巨大的开销。如果那是您的问题，您可以将其配置为使用select_related或类似的东西进行“急切加载”。
函数调用开销（例如serializable_object本身）——只要确保 ncalls forserializable_object看起来是一个合理的数字。如果我理解你的描述，它应该在 300 附近。
最后序列化 ( json.dumps(nodeInstance)) -- 这可能不是罪魁祸首，因为您说它只有 300 个节点，但如果您确实看到这占用了大量执行时间，请确保您的 JSON 编译加速正常工作。

如果您无法从分析中得知太多信息，请制作一个精简版本，例如，递归调用node.name但node.get_children()不将结果存储在数据结构中，然后看看它的比较。

更新：解决方案 3 中有 2192 次调用execute_sql，解决方案 5 中有 2192 次调用，所以我认为过多的数据库查询是一个问题，并且与select_related上面使用的方式没有任何作用。查看django-mptt 问题 #88: Allow select_related in model methods表明您或多或少地使用它是正确的，但我有疑问，并且get_childrenvs.get_descendants可能会产生巨大的差异。

还占用了大量时间copy.deepcopy，这令人费解，因为您没有直接调用它，而且我看不到它是从 MPTT 代码中调用的。什么是tree.py？

如果您在分析方面做了大量工作，我强烈推荐使用非常灵巧的工具RunSnakeRun，它可以让您以非常方便的网格形式查看您的配置文件数据并更快地理解数据。

无论如何，这是简化数据库方面的又一次尝试：

import weakref
obj_cache = weakref.WeakValueDictionary()

def serializable_object(node):
    root_obj = {'name': node.get_wbs_code(), 'wbsCode': node.get_wbs_code(),
            'id': node.pk, 'level': node.level, 'position': node.position,
            'children': []}
    obj_cache[node.pk] = root_obj
    # don't know if the following .select_related() does anything...
    for descendant in node.get_descendants().select_related():
        # get_descendants supposedly traverses in "tree order", which I think
        # means the parent obj will always be created already
        parent_obj = obj_cache[descendant.parent.pk]    # hope parent is cached
        descendant_obj = {'name': descendant.get_wbs_code(),
            'wbsCode': descendant.get_wbs_code(), 'id': descendant.pk,
            'level': descendant.level, 'position': descendant.position,
            'children': []}
        parent_obj['children'].append(descendant_obj)
        obj_cache[descendant.pk] = descendant_obj
    return root_obj

请注意，这不再是递归的。它通过节点迭代地进行，理论上在他们的父母被访问之后，并且它都使用一个大的调用MPTTModel.get_descendants()，所以希望这是很好的优化和缓存.parent等（或者也许有一种更直接的方法来获取父键？） . 它最初创建每个没有子对象的 obj，然后将所有值“嫁接”给它们的父对象。

score 0 · Accepted Answer

将数据组织到嵌套字典或列表中，然后调用json转储方法：

import json   
data = ['foo', {'bar': ('baz', None, 1.0, 2)}]
json.dump(data)

score 0 · Accepted Answer

在玩了一会儿之后，我发现解决方案都太慢了，因为 mptt 本身正在多次扫描缓存到get_children.

利用 mptt 以正确的顺序返回行以轻松构建树的事实，我做了这个：

def flat_tree_to_dict(nodes, max_depth):
    tree = []
    last_levels = [None] * max_depth
    for n in nodes:
        d = {'name': n.name}
        if n.level == 0:
            tree.append(d)
        else:
            parent_dict = last_levels[n.level - 1]
            if 'children' not in parent_dict:
                parent_dict['children'] = []
            parent_dict['children'].append(d)
        last_levels[n.level] = d
    return tree

对于我的数据集，它的运行速度比其他解决方案快 10 倍，因为它是 O(n)，只迭代数据一次。

我这样使用它：

json.dumps(flat_tree_to_dict(Model.objects.all(), 4), indent=4)

python - 使用 mptt 在 Python / Django 中创建 JSON 以反映树结构的最快方法

4 回答 4

Related

Reference