python - 高效导入 Cypher 语句

Question

我正在尝试将数据库导出到文件并再次导入，而不复制实际的数据库文件或停止数据库。我意识到有许多优秀的（和高性能的）neo4j-shell-tools，但是 Neo4j 数据库是远程的，export-*andimport-*命令要求文件驻留在远程客户端上，而对于我的情况，这些文件驻留在本地。

以下帖子解释了导出/导入数据的替代方法，但导入的性能并不过分。

以下示例使用我们的数据存储子集，该子集包含 10,000 个具有各种标签/属性的节点，用于测试目的。首先，数据库通过以下方式导出，

> time cypher-shell 'CALL apoc.export.cypher.all("/tmp/graph.db.cypher", {batchSize: 1000, format: "cypher-shell", separateFiles: true})'  
real    0m1.703s

然后擦，

neo4j stop
rm -rf /var/log/neo4j/data/databases/graph.db
neo4j start

在重新导入之前，

time cypher-shell < /tmp/graph.db.nodes.cypher
real    0m39.105s

这似乎并不过分表现。我还尝试了 Python 路由，以纯格式导出 Cypher：

CALL apoc.export.cypher.all("/tmp/graph.db.cypher", {format: "plain", separateFiles: true})

以下代码段在大约 30 秒内运行（使用 1,000 的批量大小），

from itertools import izip_longest
from neo4j.v1 import GraphDatabase


with GraphDatabase.driver('bolt://localhost:7687') as driver:
    with driver.session() as session: 
        with open('/tmp/graph.db.nodes.cypher') as file:
            for chunk in izip_longest(*[file] * 1000):
                with session.begin_transaction() as tx:
                for line in chunk:
                    if line:
                        tx.run(line)

我意识到参数化的 Cypher 查询更加优化我使用了有点笨拙的逻辑（注意字符串替换并不适用于所有情况）来尝试从 Cypher 代码中提取标签和属性（在 ~ 8 秒内执行）：

from itertools import izip_longest
import json
from neo4j.v1 import GraphDatabase
import re


def decode(statement):
    m = re.match('CREATE \((.*?)\s(.*?)\);', statement)
    labels = m.group(1).replace('`', '').split(':')[1:]
    properties = json.loads(m.group(2).replace('`', '"')) # kludgy    
    return labels, properties


with GraphDatabase.driver('bolt://localhost:7687') as driver:
    with driver.session() as session: 
        with open('/tmp/graph.db.nodes.cypher') as file:
            for chunk in izip_longest(*[file] * 1000):
                with session.begin_transaction() as tx:
                    for line in chunk:
                        if line:
                            labels, properties = decode(line)

                        tx.run(
                            'CALL apoc.create.node({labels}, {properties})', 
                            labels=labels, 
                            properties=properties,
                        )

使用UNWIND而不是事务进一步将性能提高到约 5 秒：

with GraphDatabase.driver('bolt://localhost:7687') as driver:
    with driver.session() as session: 
        with open('/tmp/graph.db.nodes.cypher') as file:
        for chunk in izip_longest(*[file] * 1000):
            rows = []

            for line in chunk:
                if line:
                    labels, properties = decode(line)
                    rows.append({'labels': labels, 'properties': properties})

            session.run(
                """
                UNWIND {rows} AS row
                WITH row.labels AS labels, row.properties AS properties
                CALL apoc.create.node(labels, properties) YIELD node
                RETURN true
                """,
                rows=rows,
            )

这是加快 Cypher 导入的正确方法吗？理想情况下，我希望不必在 Python 中进行这种级别的操作，部分原因是它可能容易出错，而且我必须为关系做类似的事情。

还有人知道解码 Cypher 以提取属性的正确方法吗？如果属性中有反引号 (`)，则此方法将失败。注意我不想走 GraphML 路线，因为我还需要通过 Cypher 格式导出的模式。虽然以这种方式打开 Cypher 的包装确实感觉很奇怪。

最后作为参考，import-binaryshell 命令需要大约 3 秒来执行相同的导入：

> neo4j-shell -c "import-binary -b 1000 -i /tmp/graph.db.bin"
...
finish after 10000 row(s)  10. 100%: nodes = 10000 rels = 0 properties = 106289 time 3 ms total 3221 ms

python - 高效导入 Cypher 语句

0 回答 0

Related

Reference