scala - Neo4j 中的多线程节点创建

Question

我分批创建了 100 万个 Neo4j 节点，每批 10000 个，每批都在自己的事务中。奇怪的是，将这个进程与多线程执行并行化并没有对性能产生任何积极影响。就好像不同线程中的事务相互阻塞一样。

这是一段 Scala 代码，它在并行集合的帮助下对此进行了测试：

import org.neo4j.kernel.EmbeddedGraphDatabase

object Main extends App {

    val total = 1000000
    val batchSize = 10000

    val db = new EmbeddedGraphDatabase("neo4yay")

    Runtime.getRuntime().addShutdownHook(
        new Thread(){override def run() = db.shutdown()}
    )

    (1 to total).grouped(batchSize).toSeq.par.foreach(batch => {

        println("thread %s, nodes from %d to %d"
            .format(Thread.currentThread().getId, batch.head, batch.last))

        val transaction = db.beginTx()
        try{
            batch.foreach(db.createNode().setProperty("Number", _))
        }finally{
            transaction.finish()
        }
    })
}

build.sbt以下是构建和运行它所需的行：

scalaVersion := "2.9.2"

libraryDependencies += "org.neo4j" % "neo4j-kernel" % "1.8.M07"

fork in run := true

.par可以通过在 external 之前删除和添加调用来在并行模式和顺序模式之间切换foreach。控制台输出清楚地表明，.par执行确实是多线程的。

To rule out possible problems with concurrency in this code, I have also tried an actor-based implementation, with about the same result (6 and 7 seconds for sequential and parallel versions, respectively).

So, the question is: did I do something wrong or this is a Neo4j limitation? Thanks!

score 4 · Accepted Answer

The main issue is that your tx arrive at about the same time. And transaction commits are serialized writes to the transaction log. If the writes would be interleaved time-wise and the actual node-creation a more expensive process you would get a speedup.

score 2 · Accepted Answer

Batch insert does not work with multiple threads. From the neo4j Documentation:

Always perform batch insertion in a single thread (or use synchronization to make only one thread at a time access the batch inserter) and invoke shutdown when finished.

Neo4j Batch insert

scala - Neo4j 中的多线程节点创建

2 回答 2

Related

Reference