我正在做一个项目,我必须根据过滤器进行实时推荐。我决定看一下graph db并开始使用neo4j并将它的性能与mysql进行比较。
行是关于:
"broadcast": 159844,
"format": 5,
"genre": 10,
"program": 60495
mysql的查询看起来像:
select f.id, sum(weight) as total
from
(
select program.id, 15 as weight
from broadcast
inner join program on broadcast.programId = program.id
where broadcast.startedAt > now() and broadcast.startedAt < date_add(now(), INTERVAL +1 DAY)
group by program.id
union all
select program.id, 10 as weight
from broadcast
inner join program on broadcast.programId = program.id
inner join genre ON program.genreId = genre.id
where genre.id in (13) and broadcast.startedAt > now() and broadcast.startedAt < date_add(now(), INTERVAL +1 DAY)
group by program.id
union all
select program.id, 5 as weight
from broadcast
inner join program on broadcast.programId = program.id
inner join genre ON program.genreId = genre.id
inner join format on genre.formatId = format.id
where format.id = 6 and broadcast.startedAt > now() and broadcast.startedAt < date_add(now(), INTERVAL +1 DAY)
group by program.id
) f
group by f.id
order by total desc, id desc
limit 0, 50
在我的本地机器上,查询在大约 300 毫秒内执行。这可能是可以接受的,但对于实时处理来说,100 毫秒以下会更好。
我还在一些帮助下写了一个 thinkerpop3 查询:
g.V().hasLabel('broadcast')
.has('startedAt', inside(new Date().getTime(), new Date().getTime() + (1000 * 60 * 60 * 24)))
.in('programBroadcast')
.dedup()
.union(
filter{true}
.as('p', 'w').select('p', 'w').by('id').by(constant(15)),
filter(out('programGenre').has('id', 4))
.as('p', 'w').select('p', 'w').by('id').by(constant(10)),
filter(out('programGenre').out('genreFormat').has('id', 4))
.as('p', 'w').select('p', 'w').by('id').by(constant(5))
)
.group().by(select("p")).by(select("w").sum())
.order(local).by(valueDecr)
.limit(local, 50)
查询执行大约 700 毫秒!
===== 编辑 =====
因为我想显示查询的分析,所以我得到了:
Step Count Traversers Time (ms) % Dur
=============================================================================================================
Neo4jGraphStep([],vertex) 220513 220513 14135.788 68.52
HasStep([~label.eq(broadcast)]) 159844 159844 391.087 1.90
VertexStep(IN,[programBroadcast],vertex) 159825 159825 267.202 1.30
DedupGlobalStep 60495 60495 95.848 0.46
UnionStep([[LambdaFilterStep(lambda)@[p, w], Pr... 63247 63247 2008.553 9.74
LambdaFilterStep(lambda)@[p, w] 60495 60495 194.406
SelectStep([p, w],[value(id), [ConstantStep(1... 60495 60495 487.158
ConstantStep(15) 60495 60495 24.214
EndStep 60495 60495 110.575
TraversalFilterStep([VertexStep(OUT,[programG... 2070 2070 410.689
VertexStep(OUT,[programGenre],vertex) 22540 22540 191.158
HasStep([id.eq(6)]) 0 0 140.934
SelectStep([p, w],[value(id), [ConstantStep(1... 2070 2070 52.203
ConstantStep(10) 2070 2070 0.654
EndStep 2070 2070 43.120
TraversalFilterStep([VertexStep(OUT,[programG... 682 682 443.347
VertexStep(OUT,[programGenre],vertex) 22540 22540 119.115
VertexStep(OUT,[genreFormat],vertex) 27510 27510 117.410
HasStep([id.eq(1)]) 0 0 133.517
SelectStep([p, w],[value(id), [ConstantStep(5... 682 682 43.247
ConstantStep(5) 682 682 0.217
EndStep 682 682 44.427
GroupStep([SelectOneStep(p), ProfileStep],[Sele... 1 1 3583.249 17.37
SelectOneStep(p) 63247 63247 26.836
SelectOneStep(w) 63247 63247 81.623
SumGlobalStep 60495 60495 3107.593
SelectOneStep(w) 0 0 0.000
SumGlobalStep 0 0 0.000
UnfoldStep 60495 60495 17.227 0.08
OrderGlobalStep([valueDecr]) 60495 60495 114.439 0.55
FoldStep 1 1 16.902 0.08
RangeLocalStep(0,10) 1 1 0.081 0.00
SideEffectCapStep([~metrics]) 1 1 0.215 0.00
- show quoted text -
这表明 68% 的时间发生在没有索引的 gV() 上!
突然间,我试图找到一种方法来获得一个单一的起点,所以我做到了:
graph.addVertex(label, 'day', 'id', 1)
graph.cypher("CREATE INDEX ON :day(id)")
g.V().hasLabel('broadcast')
.has('startedAt', inside(new Date().getTime(), new Date().getTime() + (1000 * 60 * 60 * 24)))
.map{it.get().addEdge('broadcastDay', g.V().hasLabel('day').has('id', 1).next())}
现在查询看起来像:
g.V(14727)
.in('broadcastDay')
.in('programBroadcast')
.union(
filter{true}
.as('p', 'w').select('p', 'w').by('id').by(constant(15)),
filter(out('programGenre').has('id', 4))
.as('p', 'w').select('p', 'w').by('id').by(constant(10)),
filter(out('programGenre').out('genreFormat').has('id', 4))
.as('p', 'w').select('p', 'w').by('id').by(constant(5))
)
.group().by(select("p")).by(select("w").sum())
.unfold().order().by(valueDecr).fold()
.limit(local, 50)
执行时间现在是 137 毫秒!
===== 结束编辑 =====
在我的情况下,Neo4j 比 mysql 慢...
所以我决定用这种天真的方法在密码中进行查询(感谢这篇文章):
WITH [] AS res
MATCH (b:broadcast)-[:programBroadcast]-(p:program)
WHERE b.startedAt > timestamp() and b.startedAt < (timestamp() + 1000 * 60 * 60 * 24)
OPTIONAL MATCH (p)
WITH res, COLLECT({id: p.id, weight: 15}) AS data
WITH res + data AS res
OPTIONAL MATCH (p)-[:programGenre]-(g:genre{id:4})
WITH res, (CASE WHEN g IS NOT NULL THEN COLLECT({id: p.id, weight: 10}) ELSE [] END) AS data
WITH res + data AS res
OPTIONAL MATCH (p)-[:programGenre]-(g:genre)-[:genreFormat]-(f:format{id:4})
WITH res, (CASE WHEN f IS NOT NULL THEN COLLECT({id: p.id, weight: 5}) ELSE [] END) AS data
WITH res + data AS res
UNWIND res AS result
WITH result, result.id as id, SUM(result.weight) as weight
ORDER BY weight DESC
LIMIT 10
RETURN id, weight
我大约是 68614 毫秒!
我对graph db非常失望,但我不明白为什么,我在每个属性上都使用了索引并将java内存设置为4g左右,与mysql相比它卡住了,为什么?图数据库仅适用于 mysql 无法执行连接的大数据?