我有点难过。
在我的数据库中,我有这样的关系:
(u:User)-[r1:LISTENS_TO]->(a:Artist)<-[r2:LISTENS_TO]-(u2:User)
我想对给定用户执行查询,在哪里找到该用户和其他所有用户之间的共同艺术家。
为了大致了解我的数据库的大小,我有大约 600 个用户、47,546 位艺术家以及用户和艺术家之间的 184,211 个关系。
我尝试的第一个查询如下:
START me=node(553314), other=node:userLocations("withinDistance:[38.89037,-77.03196,80.467]")
OPTIONAL MATCH
pMutualArtists=(me:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(other:User)
WHERE
other:User
WITH other, COUNT(DISTINCT pMutualArtists) AS mutualArtists
ORDER BY mutualArtists DESC
LIMIT 10
RETURN other.username, mutualArtists
这需要大约 20 秒才能返回。此查询的配置文件如下:
+----------------------+-------+--------+------------------------+------------------------------------------------------------------------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+----------------------+-------+--------+------------------------+------------------------------------------------------------------------------------------------+
| ColumnFilter(0) | 10 | 0 | | keep columns other.username, mutualArtists |
| Extract | 10 | 20 | | other.username |
| ColumnFilter(1) | 10 | 0 | | keep columns other, mutualArtists |
| Top | 10 | 0 | | { AUTOINT0}; Cached( INTERNAL_AGGREGATEb6facb18-1c5d-45a6-83bf-a75c25ba6baf of type Integer) |
| EagerAggregation | 563 | 0 | | other |
| OptionalMatch | 52806 | 0 | | |
| Eager(0) | 563 | 0 | | |
| NodeByIndexQuery(1) | 563 | 564 | other, other | Literal(withinDistance:[38.89037,-77.03196,80.467]); userLocations |
| NodeById(1) | 1 | 1 | me, me | Literal(List(553314)) |
| Eager(1) | 82 | 0 | | |
| ExtractPath | 82 | 0 | pMutualArtists | |
| Filter(0) | 82 | 82 | | (hasLabel(a:Artist(1)) AND NOT(ar1 == ar2)) |
| SimplePatternMatcher | 82 | 82 | a, me, ar2, ar1, other | |
| Filter(1) | 1 | 3 | | ((hasLabel(me:User(3)) AND hasLabel(other:User(3))) AND hasLabel(other:User(3))) |
| NodeByIndexQuery(1) | 563 | 564 | other, other | Literal(withinDistance:[38.89037,-77.03196,80.467]); userLocations |
| NodeById(1) | 1 | 1 | me, me | Literal(List(553314)) |
+----------------------+-------+--------+------------------------+------------------------------------------------------------------------------------------------+
我很沮丧。这似乎不需要 20 秒。
后来我又回到了这个问题上,并尝试从一开始就对其进行调试。
我开始分解查询,我注意到我得到了更快的结果。如果没有 Neo4J Spatial 查询,我在大约 1.5 秒内得到结果。
我终于添加了一些东西,最后得到了以下查询:
START u=node(553314), u2=node:userLocations("withinDistance:[38.89037,-77.03196,80.467]")
OPTIONAL MATCH
pMutualArtists=(u:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(u2:User)
WHERE
u2:User
WITH u2, COUNT(DISTINCT pMutualArtists) AS mutualArtists
ORDER BY mutualArtists DESC
LIMIT 10
RETURN u2.username, mutualArtists
此查询在 4240 毫秒内返回。5倍的提升!此查询的配置文件如下:
+----------------------+-------+--------+--------------------+------------------------------------------------------------------------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+----------------------+-------+--------+--------------------+------------------------------------------------------------------------------------------------+
| ColumnFilter(0) | 10 | 0 | | keep columns u2.username, mutualArtists |
| Extract | 10 | 20 | | u2.username |
| ColumnFilter(1) | 10 | 0 | | keep columns u2, mutualArtists |
| Top | 10 | 0 | | { AUTOINT0}; Cached( INTERNAL_AGGREGATEbdf86ac1-8677-4d45-967f-c2dd594aba49 of type Integer) |
| EagerAggregation | 563 | 0 | | u2 |
| OptionalMatch | 52806 | 0 | | |
| Eager(0) | 563 | 0 | | |
| NodeByIndexQuery(1) | 563 | 564 | u2, u2 | Literal(withinDistance:[38.89037,-77.03196,80.467]); userLocations |
| NodeById(1) | 1 | 1 | u, u | Literal(List(553314)) |
| Eager(1) | 82 | 0 | | |
| ExtractPath | 82 | 0 | pMutualArtists | |
| Filter(0) | 82 | 82 | | (hasLabel(a:Artist(1)) AND NOT(ar1 == ar2)) |
| SimplePatternMatcher | 82 | 82 | a, u2, u, ar2, ar1 | |
| Filter(1) | 1 | 3 | | ((hasLabel(u:User(3)) AND hasLabel(u2:User(3))) AND hasLabel(u2:User(3))) |
| NodeByIndexQuery(1) | 563 | 564 | u2, u2 | Literal(withinDistance:[38.89037,-77.03196,80.467]); userLocations |
| NodeById(1) | 1 | 1 | u, u | Literal(List(553314)) |
+----------------------+-------+--------+--------------------+------------------------------------------------------------------------------------------------+
而且,为了证明我连续运行它们并得到了非常不同的结果:
neo4j-sh (?)$ START u=node(553314), u2=node:userLocations("withinDistance:[38.89037,-77.03196,80.467]")
>
> OPTIONAL MATCH
> pMutualArtists=(u:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(u2:User)
> WHERE
> u2:User
>
> WITH u2, COUNT(DISTINCT pMutualArtists) AS mutualArtists
> ORDER BY mutualArtists DESC
> LIMIT 10
> RETURN u2.username, mutualArtists
> ;
+------------------------------+
| u2.username | mutualArtists |
+------------------------------+
| "573904765" | 644 |
| "28600291" | 601 |
| "1092510304" | 558 |
| "1367963461" | 521 |
| "1508790199" | 455 |
| "1335360028" | 447 |
| "18200866" | 444 |
| "1229430376" | 435 |
| "748318333" | 434 |
| "5612902" | 431 |
+------------------------------+
10 rows
4240 ms
neo4j-sh (?)$ START me=node(553314), other=node:userLocations("withinDistance:[38.89037,-77.03196,80.467]")
>
> OPTIONAL MATCH
> pMutualArtists=(me:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(other:User)
> WHERE
> other:User
>
> WITH other, COUNT(DISTINCT pMutualArtists) AS mutualArtists
> ORDER BY mutualArtists DESC
> LIMIT 10
> RETURN other.username, mutualArtists;
+--------------------------------+
| other.username | mutualArtists |
+--------------------------------+
| "573904765" | 644 |
| "28600291" | 601 |
| "1092510304" | 558 |
| "1367963461" | 521 |
| "1508790199" | 455 |
| "1335360028" | 447 |
| "18200866" | 444 |
| "1229430376" | 435 |
| "748318333" | 434 |
| "5612902" | 431 |
+--------------------------------+
10 rows
20418 ms
除非我发疯了,否则这两个查询之间的唯一区别是节点的名称(我已将“me”更改为“u”,将“other”更改为“u2”)。
为什么这会导致 5 倍的改进??!?!
如果有人对此有任何见解,我将永远感激不尽。
谢谢,
-亚当
编辑 8.1.14
根据@ulkas 的建议,我尝试简化查询。
结果是:
START u=node(553314), u2=node:userLocations("withinDistance:[38.89037,-77.03196,80.467]")
OPTIONAL MATCH pMutualArtists=(u:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(u2:User)
RETURN u2.username, COUNT(DISTINCT pMutualArtists) as mutualArtists
ORDER BY mutualArtists DESC
LIMIT 10
~4 秒
START me=node(553314), other=node:userLocations("withinDistance:[38.89037,-77.03196,80.467]")
OPTIONAL MATCH pMutualArtists=(me:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(other:User)
RETURN other.username, COUNT(DISTINCT pMutualArtists) as mutualArtists
ORDER BY mutualArtists DESC
LIMIT 10
~20 秒
太奇怪了。似乎从字面上看,“其他”和“我”的命名节点导致查询时间急剧增加。我很困惑。
谢谢,-亚当