0

我正在做一个项目,我必须根据过滤器进行实时推荐。我决定看一下graph db并开始使用neo4j并将它的性能与mysql进行比较。

行是关于:

 "broadcast": 159844,
 "format": 5,
 "genre": 10,
 "program": 60495

mysql的查询看起来像:

select f.id, sum(weight) as total
from
(
    select program.id, 15 as weight
    from broadcast
    inner join program on broadcast.programId = program.id
    where broadcast.startedAt > now() and broadcast.startedAt < date_add(now(), INTERVAL +1 DAY)
    group by program.id
union all
    select program.id, 10 as weight
    from broadcast
    inner join program on broadcast.programId = program.id
    inner join genre ON program.genreId = genre.id
    where genre.id in (13) and broadcast.startedAt > now() and broadcast.startedAt < date_add(now(), INTERVAL +1 DAY)
    group by program.id
union all
    select program.id, 5 as weight
    from broadcast
    inner join program on broadcast.programId = program.id
    inner join genre ON program.genreId = genre.id
    inner join format on genre.formatId = format.id
    where format.id = 6 and broadcast.startedAt > now() and broadcast.startedAt < date_add(now(), INTERVAL +1 DAY)
    group by program.id
) f
group by f.id
order by total desc, id desc
limit 0, 50

在我的本地机器上,查询在大约 300 毫秒内执行。这可能是可以接受的,但对于实时处理来说,100 毫秒以下会更好。

我还在一些帮助下写了一个 thinkerpop3 查询:

g.V().hasLabel('broadcast')
.has('startedAt', inside(new Date().getTime(), new Date().getTime() + (1000 * 60 * 60 * 24)))
.in('programBroadcast')
.dedup()
.union(
   filter{true}
     .as('p', 'w').select('p', 'w').by('id').by(constant(15)),
   filter(out('programGenre').has('id', 4))
     .as('p', 'w').select('p', 'w').by('id').by(constant(10)),
   filter(out('programGenre').out('genreFormat').has('id', 4))
     .as('p', 'w').select('p', 'w').by('id').by(constant(5))
)
.group().by(select("p")).by(select("w").sum())
.order(local).by(valueDecr)
.limit(local, 50)

查询执行大约 700 毫秒!

===== 编辑 =====

因为我想显示查询的分析,所以我得到了:

Step                                                               Count  Traversers       Time (ms)    % Dur
=============================================================================================================
Neo4jGraphStep([],vertex)                                         220513      220513       14135.788    68.52
HasStep([~label.eq(broadcast)])                                   159844      159844         391.087     1.90
VertexStep(IN,[programBroadcast],vertex)                          159825      159825         267.202     1.30
DedupGlobalStep                                                    60495       60495          95.848     0.46
UnionStep([[LambdaFilterStep(lambda)@[p, w], Pr...                 63247       63247        2008.553     9.74
  LambdaFilterStep(lambda)@[p, w]                                  60495       60495         194.406
  SelectStep([p, w],[value(id), [ConstantStep(1...                 60495       60495         487.158
    ConstantStep(15)                                               60495       60495          24.214
  EndStep                                                          60495       60495         110.575
  TraversalFilterStep([VertexStep(OUT,[programG...                  2070        2070         410.689
    VertexStep(OUT,[programGenre],vertex)                          22540       22540         191.158
    HasStep([id.eq(6)])                                                0           0         140.934
  SelectStep([p, w],[value(id), [ConstantStep(1...                  2070        2070          52.203
    ConstantStep(10)                                                2070        2070           0.654
  EndStep                                                           2070        2070          43.120
  TraversalFilterStep([VertexStep(OUT,[programG...                   682         682         443.347
    VertexStep(OUT,[programGenre],vertex)                          22540       22540         119.115
    VertexStep(OUT,[genreFormat],vertex)                           27510       27510         117.410
    HasStep([id.eq(1)])                                                0           0         133.517
  SelectStep([p, w],[value(id), [ConstantStep(5...                   682         682          43.247
    ConstantStep(5)                                                  682         682           0.217
  EndStep                                                            682         682          44.427
GroupStep([SelectOneStep(p), ProfileStep],[Sele...                     1           1        3583.249    17.37
  SelectOneStep(p)                                                 63247       63247          26.836
  SelectOneStep(w)                                                 63247       63247          81.623
  SumGlobalStep                                                    60495       60495        3107.593
  SelectOneStep(w)                                                     0           0           0.000
  SumGlobalStep                                                        0           0           0.000
UnfoldStep                                                         60495       60495          17.227     0.08
OrderGlobalStep([valueDecr])                                       60495       60495         114.439     0.55
FoldStep                                                               1           1          16.902     0.08
RangeLocalStep(0,10)                                                   1           1           0.081     0.00
SideEffectCapStep([~metrics])                                          1           1           0.215     0.00
- show quoted text -

这表明 68% 的时间发生在没有索引的 gV() 上!

突然间,我试图找到一种方法来获得一个单一的起点,所以我做到了:

graph.addVertex(label, 'day', 'id', 1)
graph.cypher("CREATE INDEX ON :day(id)")
g.V().hasLabel('broadcast')
.has('startedAt', inside(new Date().getTime(), new Date().getTime() + (1000 * 60 * 60 * 24)))
.map{it.get().addEdge('broadcastDay', g.V().hasLabel('day').has('id', 1).next())}

现在查询看起来像:

g.V(14727)
.in('broadcastDay')
.in('programBroadcast')
.union(
    filter{true}
.as('p', 'w').select('p', 'w').by('id').by(constant(15)),
   filter(out('programGenre').has('id', 4))
   .as('p', 'w').select('p', 'w').by('id').by(constant(10)),
   filter(out('programGenre').out('genreFormat').has('id', 4))
   .as('p', 'w').select('p', 'w').by('id').by(constant(5))
)
.group().by(select("p")).by(select("w").sum())
.unfold().order().by(valueDecr).fold()
.limit(local, 50)

执行时间现在是 137 毫秒!

===== 结束编辑 =====

在我的情况下,Neo4j 比 mysql 慢...

所以我决定用这种天真的方法在密码中进行查询(感谢这篇文章):

WITH [] AS res

MATCH (b:broadcast)-[:programBroadcast]-(p:program)
WHERE b.startedAt > timestamp() and b.startedAt < (timestamp() + 1000 * 60 * 60 * 24)

OPTIONAL MATCH (p)
WITH res, COLLECT({id: p.id, weight: 15}) AS data
WITH res + data AS res

OPTIONAL MATCH (p)-[:programGenre]-(g:genre{id:4})
WITH res, (CASE WHEN g IS NOT NULL THEN COLLECT({id: p.id, weight: 10}) ELSE [] END) AS data
WITH res + data AS res

OPTIONAL MATCH (p)-[:programGenre]-(g:genre)-[:genreFormat]-(f:format{id:4})
WITH res, (CASE WHEN f IS NOT NULL THEN COLLECT({id: p.id, weight: 5}) ELSE [] END) AS data
WITH res + data AS res

UNWIND res AS result
WITH result, result.id as id, SUM(result.weight) as weight
ORDER BY weight DESC
LIMIT 10
RETURN id, weight     

我大约是 68614 毫秒!

我对graph db非常失望,但我不明白为什么,我在每个属性上都使用了索引并将java内存设置为4g左右,与mysql相比它卡住了,为什么?图数据库仅适用于 mysql 无法执行连接的大数据?

4

0 回答 0