I have a set of log data in the form of flat files from which I want to form a graph (based on information in the log) and load it into the Titan database. This data is a few gigabytes in size. I am exploring bulk loading options Faunus and BatchGraph ( which I read about in https://github.com/thinkaurelius/titan/wiki/Bulk-Loading) . The tab separated log data I have needs a bit of processing on each line of the file to form the graph nodes and edges I have in mind. Will Faunus/BatchGraph serve this use case? If yes, what format should my input file be in for these tools to work? If not, is using the BluePrints API the way to go? Any resources you can share on your suggestion is very much appreciated since I'm a novice. Thanks!
1 回答
To answer your question in simple fashion, I think you will want to use Faunus to load your data. I would recommend cleaning and transforming your data with external tools first if possible. Tab-delimited is a fine format, but how you prepare these file can have impact on loading performance (e.g. sometimes simply sorting the data the right way can provide a big speed boost.)
The more complete answer lies in these two resources. They should help you decide on an approach:
http://thinkaurelius.com/2014/05/29/powers-of-ten-part-i/ http://thinkaurelius.com/2014/06/02/powers-of-ten-part-ii/
I would offer this additional advice - if you are truly a novice, I recommend that you find some slice of your data that produces somewhere between 100K and 1M edges. Focus on simply loading that with BatchGraph
or just the Blueprints API as described in Part I of those blog posts. Get used to Gremlin a bit by querying the data in this small case. Use this time to develop methods for validating what you've loaded. Once you feel comfortable with all of that, then work on scaling it up to the full size.