performance - 无法重现/验证图形数据库中的性能声明和动作书中的 neo4j

Question

更新我提出了一个后续问题，其中包含更新的脚本以及与mysql 相比更清晰的 neo4j 性能设置（如何改进？）。请在那里继续。/更新

我在验证“图形数据库”一书（第 20 页）和 neo4j（第 1 章）中的性能声明时遇到了一些问题。

为了验证这些说法，我创建了一个包含 100000 个“人”条目的样本数据集，每个条目有 50 个“朋友”，并尝试查询例如 4 跳外的朋友。我在 mysql 中使用了相同的数据集。与朋友的朋友超过 4 跳mysql 在 0.93 秒内返回，而neo4j 需要 65 -75 秒（重复调用）。

我怎样才能改善这种悲惨的结果，并验证书中的说法？

更详细一点：

我在具有 16GB Ram 的 i5-3570K 上运行整个设置，使用 ubuntu12.04 64 位、java 版本“1.7.0_25”和 mysql 5.5.31、neo4j-community-2.0.0-M03（我得到了与 1.9 类似的结果)

所有代码/示例数据都可以在https://github.com/jhb/neo4j-exeriements/上找到（与 2.0.0 一起使用）。可以在https://github.com/jhb/neo4j-testdata上找到不同格式的生成样本数据。

要使用脚本，您需要安装了 mysql-python、requests 和 simplejson 的 python。

数据集使用friendsdata.py 创建并存储到friends.pickle
使用 import_friends_neo4j.py 将friends.pickle 导入neo4j
Friends.pickle 使用 import_friends_mysql.py 导入到 mysql
我在 mysql 中的 t_user_friend.* 上添加索引
我在 neo4j 中添加了“在 :node(noscenda_name) 上创建索引”

为了让朋友们生活更轻松。*.bz2 包含 sql 和 cypher 语句，用于在 mysql 和 neo4j 2.0 M3 中创建这些数据集。

mysql性能

我首先通过查询来预热 mysql：

select count(distinct name) from t_user;
select count(distinct name) from t_user;

然后，对于我做的真正的测量

python query_friends_mysql.py 4 10

这将创建以下 sql 语句（更改 t_user.names）：

select 
    count(*)
from
    t_user,
    t_user_friend as uf1, 
    t_user_friend as uf2, 
    t_user_friend as uf3, 
    t_user_friend as uf4
where
    t_user.name='person8601' and 
    t_user.id = uf1.user_1 and
    uf1.user_2 = uf2.user_1 and
    uf2.user_2 = uf3.user_1 and
    uf3.user_2 = uf4.user_1;

并重复这个 4 跳查询 10 次。每个查询大约需要 0.95 秒。Mysql配置为使用4G的key_buffer。

neo4j 性能测试

我已经修改了 neo4j.properties：

neostore.nodestore.db.mapped_memory=25M
neostore.relationshipstore.db.mapped_memory=250M

和 neo4j-wrapper.conf：

wrapper.java.initmemory=2048
wrapper.java.maxmemory=8192

为了热身neo4j我做

start n=node(*) return count(n.noscenda_name);
start r=relationship(*) return count(r);

然后我开始使用事务性 http 端点（但我使用 neo4j-shell 得到了相同的结果）。

还在热身，我跑

./bin/python query_friends_neo4j.py 3 10

这将创建一个表单查询（具有不同的人员 ID）：

{"statement": "match n:node-[r*3..3]->m:node where n.noscenda_name={target} return count(r);", "parameters": {"target": "person3089"}

在第 7 次左右之后，每次通话需要大约 0.7-0.8 秒。

现在对于真实的东西（4 跳）我做

./bin/python query_friends_neo4j.py 4 10

创造

{"statement": "match n:node-[r*4..4]->m:node where n.noscenda_name={target} return count(r);", "parameters": {"target": "person3089"}

每次通话需要 65 到 75 秒。

开放的问题/想法

我真的很希望看到书中的主张是可重现和正确的，neo4j 比 mysql 更快，而不是更慢。

但我不知道我做错了什么...... :-(

所以，我最大的希望是：

我没有正确地为 neo4j 进行内存设置
我用于 neo4j 的查询是完全错误的

任何让 Neo4j 加速的建议都非常受欢迎。

非常感谢，

约尔格

score 1 · Accepted Answer

2.0 根本没有进行性能优化，所以你应该使用 1.9.2 进行比较。（如果您使用 2.0 - 您是否为 n.noscenda_name 创建了索引）

您可以使用来检查查询计划profile start ...。

对于 1.9，请使用手动索引或node_auto_index用于noscenda_name.

你可以试试这些查询：

start n=node:node_auto_index(noscenda_name={target})
match n-->()-->()-->m
return count(*);

全文索引也比精确索引更昂贵，因此请exact保留noscenda_name.

无法让您的导入器运行，它有时会失败，也许您可以共享完成的 neo4j 数据库

python importer.py
reading rels
reading nodes
delete old
Traceback (most recent call last):
  File "importer.py", line 9, in <module>
    g.query('match n-[r]->m delete r;')
  File "/Users/mh/java/neo/neo4j-experiements/neo4jconnector.py", line 99, in query
    return self.call(payload)
  File "/Users/mh/java/neo/neo4j-experiements/neo4jconnector.py", line 71, in call
    self.transactionurl = result.headers['location']
  File "/Library/Python/2.7/site-packages/requests-1.2.3-py2.7.egg/requests/structures.py", line 77, in __getitem__
    return self._store[key.lower()][1]
KeyError: 'location'

score 0 · Accepted Answer

只是为了补充迈克尔所说的，在书中，我相信作者指的是在 Neo4j in Action 书中所做的比较 - 它在该书的免费第一章中进行了描述。

在第 7 页的顶部，他们解释说他们使用的是Traversal API而不是 Cypher。

我认为你现在很难让 Cypher 接近那个性能水平，所以如果你想做这些类型的查询，你会想直接使用 Traversal API，然后也许将它包装在一个非托管扩展中。

performance - 无法重现/验证图形数据库中的性能声明和动作书中的 neo4j

更详细一点：

mysql性能

neo4j 性能测试

开放的问题/想法

2 回答 2

Related

Reference