我分批创建了 100 万个 Neo4j 节点,每批 10000 个,每批都在自己的事务中。奇怪的是,将这个进程与多线程执行并行化并没有对性能产生任何积极影响。就好像不同线程中的事务相互阻塞一样。
这是一段 Scala 代码,它在并行集合的帮助下对此进行了测试:
import org.neo4j.kernel.EmbeddedGraphDatabase
object Main extends App {
val total = 1000000
val batchSize = 10000
val db = new EmbeddedGraphDatabase("neo4yay")
Runtime.getRuntime().addShutdownHook(
new Thread(){override def run() = db.shutdown()}
)
(1 to total).grouped(batchSize).toSeq.par.foreach(batch => {
println("thread %s, nodes from %d to %d"
.format(Thread.currentThread().getId, batch.head, batch.last))
val transaction = db.beginTx()
try{
batch.foreach(db.createNode().setProperty("Number", _))
}finally{
transaction.finish()
}
})
}
build.sbt
以下是构建和运行它所需的行:
scalaVersion := "2.9.2"
libraryDependencies += "org.neo4j" % "neo4j-kernel" % "1.8.M07"
fork in run := true
.par
可以通过在 external 之前删除和添加调用来在并行模式和顺序模式之间切换foreach
。控制台输出清楚地表明,.par
执行确实是多线程的。
To rule out possible problems with concurrency in this code, I have also tried an actor-based implementation, with about the same result (6 and 7 seconds for sequential and parallel versions, respectively).
So, the question is: did I do something wrong or this is a Neo4j limitation? Thanks!