We are running v1.9.1, (latest stable release), of Neo4j in Embedded mode. We have had a couple of situations where the process has shutdown unexpectedly and the neo4j.shutdown() has not been called. Note: when this has occurred we know there is no outstanding updates or changes occurring to the neoDB. Also this is on a linux OS.
When the application is started up again and it starts the connection to neo4j it begins the recovery process but is hanging forever. The messages.log file shows:
2013-07-17 21:05:09.143+0000 INFO [o.n.k.i.t.x.XaLogicalLog]: XaResourceManager[nioneo_logical.log] recovery completed.
2013-07-17 21:05:09.143+0000 INFO [o.n.k.i.t.x.XaLogicalLog]: Recovery on log [/opt/pricing/data/database/app/nioneo_logical.log.1] completed.
2013-07-17 21:05:09.156+0000 INFO [o.n.k.i.t.TxManager]: TM opening log: /opt/pricing/data/database/app/tm_tx_log.2
2013-07-17 21:05:09.245+0000 INFO [o.n.b.BackupServer]: BackupServer communication server started and bound to /0.0.0.0:6362
2013-07-17 21:05:09.271+0000 INFO [o.n.k.i.t.x.XaLogicalLog]: Non clean shutdown detected on log [/opt/pricing/data/database/app/index/lucene.log.2]. Recovery started ...
2013-07-17 21:05:09.271+0000 INFO [o.n.k.i.t.x.XaLogicalLog]: [/opt/pricing/data/database/app/index/lucene.log.2] logVersion=3 with committed tx=317
What's most interesting, we copied the DB over to a desktop and created a little program that just starts the DB then shuts it down and ran it against the DB. It recovered no problems and in only a couple of seconds, (this may be because the hang process had partially recovered the DB, but we don't think so because the application does recover the DB if we kill it and try running it again) We repeated this on the linux machine with the same successful results.
We are obviously working on trying ensure shutdown will always be called on an unexpected termination of the application, but the real problem is why is the recovery process hanging when starting up? We did find the following https://groups.google.com/forum/#!msg/neo4j/CBvuMybTRFw/NMIOpBjrIYIJ but that talks about running the DB as a server and just increasing the timeout. Although the point in the messages.log is exactly the same location as mine.
As a temporary solution if the recovery hangs we can run the little 'dummy' program to see if the DB will get fixed, but would rather get to the root cause.
Does anybody have any advice?