1

我正在尝试评估 Neo4j(使用社区版本)。
我正在使用 LOAD CSV 进程导入一些数据(100 万行)。它需要匹配先前导入的节点以在它们之间创建关系。

这是我的查询:

//Query #3
//create edges between Tr and Ad nodes

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///1M.txt'
AS line
 FIELDTERMINATOR '\t'

//find appropriate tx and ad
MATCH (tx:Tr { txid: TOINT(line.txid) }), (ad:Ad {p58: line.p58})

//create the edge (relationship)
CREATE (tx)-[out:OUT_TO]->(ad)

//set properties on the edge
SET out.id= TOINT(line.id)
SET out.n = TOINT(line.n)
SET out.v = TOINT(line.v)

我有以下指标:

Indexes
  ON :Ad(p58)       ONLINE (for uniqueness constraint) 
  ON :Tr(txid)      ONLINE                             
  ON :Tr(h)         ONLINE (for uniqueness constraint)

这个查询已经运行了 5 天,到目前为止它已经创建了 270K 关系(超过 1M)。
Java 堆是 4g
机器有 32G 的 RAM 和一个用于驱动器的 SSD,只运行 linux 和 Neo4j

任何加快此过程的提示将不胜感激。
我应该尝试企业版吗?

查询计划:

+--------------------------------------------+
| No data returned, and nothing was changed. |
+--------------------------------------------+
If a part of a query contains multiple disconnected patterns, 
this will build a cartesian product between all those parts.
This may produce a large amount of data and slow down query processing.
While occasionally intended, 
it may often be possible to reformulate the query that avoids the use of this cross product,
 perhaps by adding a relationship between the different parts or by using OPTIONAL MATCH (identifier is: (ad))
20 ms

Compiler CYPHER 3.0

Planner COST

Runtime INTERPRETED

+---------------------------------+----------------+---------------------+----------------------------+
| Operator                        | Estimated Rows | Variables           | Other                      |
+---------------------------------+----------------+---------------------+----------------------------+
| +ProduceResults                 |              1 |                     |                            |
| |                               +----------------+---------------------+----------------------------+
| +EmptyResult                    |                |                     |                            |
| |                               +----------------+---------------------+----------------------------+
| +Apply                          |              1 | line -- ad, out, tx |                            |
| |\                              +----------------+---------------------+----------------------------+
| | +SetRelationshipProperty(4)   |              1 | ad, out, tx         |                            |
| | |                             +----------------+---------------------+----------------------------+
| | +CreateRelationship           |              1 | out -- ad, tx       |                            |
| | |                             +----------------+---------------------+----------------------------+
| | +ValueHashJoin                |              1 | ad -- tx            | ad.p58; line.p58           |
| | |\                            +----------------+---------------------+----------------------------+
| | | +NodeIndexSeek              |              1 | tx                  | :Tr(txid)                  |
| | |                             +----------------+---------------------+----------------------------+
| | +NodeUniqueIndexSeek(Locking) |              1 | ad                  | :Ad(p58)                   |
| |                               +----------------+---------------------+----------------------------+
| +LoadCSV                        |              1 | line                |                            |
+---------------------------------+----------------+---------------------+----------------------------+
4

1 回答 1

2

好的,所以通过将 MATCH 语句分成两个,它极大地加快了查询速度。感谢@William Lyon 向我指出该计划。我注意到了警告。

旧的 MATCH atatement

MATCH (tx:Tr { txid: TOINT(line.txid) }), (ad:Ad {p58: line.p58})

分成两个:

MATCH (tx:Tr { txid: TOINT(line.txid) })
MATCH (ad:Ad {p58: line.p58})

在 750K 关系上,查询耗时 83 秒。
接下来 900 万个 CSV 负载

于 2016-07-14T17:37:46.583 回答