4

我分批创建了 100 万个 Neo4j 节点,每批 10000 个,每批都在自己的事务中。奇怪的是,将这个进程与多线程执行并行化并没有对性能产生任何积极影响。就好像不同线程中的事务相互阻塞一样。

这是一段 Scala 代码,它在并行集合的帮助下对此进行了测试:

import org.neo4j.kernel.EmbeddedGraphDatabase

object Main extends App {

    val total = 1000000
    val batchSize = 10000

    val db = new EmbeddedGraphDatabase("neo4yay")

    Runtime.getRuntime().addShutdownHook(
        new Thread(){override def run() = db.shutdown()}
    )

    (1 to total).grouped(batchSize).toSeq.par.foreach(batch => {

        println("thread %s, nodes from %d to %d"
            .format(Thread.currentThread().getId, batch.head, batch.last))

        val transaction = db.beginTx()
        try{
            batch.foreach(db.createNode().setProperty("Number", _))
        }finally{
            transaction.finish()
        }
    })
}

build.sbt以下是构建和运行它所需的行:

scalaVersion := "2.9.2"

libraryDependencies += "org.neo4j" % "neo4j-kernel" % "1.8.M07"

fork in run := true

.par可以通过在 external 之前删除和添加调用来在并行模式和顺序模式之间切换foreach。控制台输出清楚地表明,.par执行确实是多线程的。

To rule out possible problems with concurrency in this code, I have also tried an actor-based implementation, with about the same result (6 and 7 seconds for sequential and parallel versions, respectively).

So, the question is: did I do something wrong or this is a Neo4j limitation? Thanks!

4

2 回答 2

4

The main issue is that your tx arrive at about the same time. And transaction commits are serialized writes to the transaction log. If the writes would be interleaved time-wise and the actual node-creation a more expensive process you would get a speedup.

于 2012-09-27T22:58:36.320 回答
2

Batch insert does not work with multiple threads. From the neo4j Documentation:

Always perform batch insertion in a single thread (or use synchronization to make only one thread at a time access the batch inserter) and invoke shutdown when finished.

Neo4j Batch insert

于 2012-09-27T09:41:39.340 回答