0

我正在尝试对三个不同的图形数据库TitanOrientDBNeo4j进行基准测试。我想测量数据库创建的执行时间。作为测试用例,我使用这个数据集http://snap.stanford.edu/data/web-flickr.html。尽管数据存储在本地而不是计算机内存中,但我注意到它消耗了很多内存,不幸的是,过了一会儿 eclipse 崩溃了。为什么会这样?

以下是一些代码片段: Titan 图创建

public long createGraphDB(String datasetRoot, TitanGraph titanGraph) {
    long duration;
    long startTime = System.nanoTime();
    try {
        BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(datasetRoot)));
        String line;
        int lineCounter = 1;
        while((line = reader.readLine()) != null) {
            if(lineCounter > 4) {
                String[] parts = line.split(" ");
                Vertex srcVertex = titanGraph.addVertex(null);
                srcVertex.setProperty( "nodeId", parts[0] );
                Vertex dstVertex = titanGraph.addVertex(null);
                dstVertex.setProperty( "nodeId", parts[1] );
                Edge edge = titanGraph.addEdge(null, srcVertex, dstVertex, "similar");
                titanGraph.commit();
            }
            lineCounter++;
        }
        reader.close();
    }
    catch(IOException ioe) {
        ioe.printStackTrace();
    }
    catch( Exception e ) {    
        titanGraph.rollback();
    }
    long endTime = System.nanoTime();
    duration = endTime - startTime;
    return duration;
}

OrientDB 图创建:

public long createGraphDB(String datasetRoot, OrientGraph orientGraph) {
    long duration;
    long startTime = System.nanoTime();
    try {
        BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(datasetRoot)));
        String line;
        int lineCounter = 1;    
        while((line = reader.readLine()) != null) {
            if(lineCounter > 4) {
                String[] parts = line.split(" ");
                Vertex srcVertex = orientGraph.addVertex(null);
                srcVertex.setProperty( "nodeId", parts[0] );
                Vertex dstVertex = orientGraph.addVertex(null);
                dstVertex.setProperty( "nodeId", parts[1] );
                Edge edge = orientGraph.addEdge(null, srcVertex, dstVertex, "similar");
                orientGraph.commit();
            }
            lineCounter++;
        }
        reader.close();
    }
    catch(IOException ioe) {
        ioe.printStackTrace();
    }
    catch( Exception e ) {    
        orientGraph.rollback();
    }
    long endTime = System.nanoTime();
    duration = endTime - startTime;
    return duration;

Neo4j 图创建:

public long createDB(String datasetRoot, GraphDatabaseService neo4jGraph) {
    long duration;
    long startTime = System.nanoTime(); 
    Transaction tx = neo4jGraph.beginTx();
    try {
        BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(datasetRoot)));
        String line;
        int lineCounter = 1;
        while((line = reader.readLine()) != null) {
            if(lineCounter > 4) {
                String[] parts = line.split(" ");
                Node srcNode = neo4jGraph.createNode();
                srcNode.setProperty("nodeId", parts[0]);
                Node dstNode = neo4jGraph.createNode();
                dstNode.setProperty("nodeId", parts[1]);
                Relationship relationship = srcNode.createRelationshipTo(dstNode, RelTypes.SIMILAR);
            }
            lineCounter++;
        }
        tx.success();
        reader.close();
    } 
    catch (IOException e) {
        e.printStackTrace();
    }
    finally {
        tx.finish();
    }
    long endTime = System.nanoTime();
    duration = endTime - startTime;
    return duration;
}

编辑:我尝试了 BatchGraph 解决方案,似乎它将永远运行。它昨天运行了一整夜,从未结束。我不得不阻止它。我的代码有什么问题吗?

TitanGraph graph = TitanFactory.open("data/titan");
    BatchGraph<TitanGraph> batchGraph = new BatchGraph<TitanGraph>(graph, VertexIDType.STRING, 1000);
    try {
        BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream("data/flickrEdges.txt")));
        String line;
        int lineCounter = 1;
        while((line = reader.readLine()) != null) {
            if(lineCounter > 4) {
                String[] parts = line.split(" ");
                Vertex srcVertex = batchGraph.getVertex(parts[0]);
                if(srcVertex == null) {
                    srcVertex = batchGraph.addVertex(parts[0]);
                }
                Vertex dstVertex = batchGraph.getVertex(parts[1]);
                if(dstVertex == null) {
                    dstVertex = batchGraph.addVertex(parts[1]);
                }
                Edge edge = batchGraph.addEdge(null, srcVertex, dstVertex, "similar");
                batchGraph.commit();
            }
            lineCounter++;
        }
        reader.close();
    }
4

3 回答 3

2
于 2013-11-12T08:12:28.260 回答
1

使用OrientDB,您可以通过 2 种方式优化此导入:

  1. 使用自定义扩展和
  2. 完全避免使用事务

因此,使用 OrientGraphNoTx 而不是 OrientGraph 打开图形,然后尝试以下代码段:

OrientVertex srcVertex = orientGraph.addVertex(null, "nodeId", parts[0] );
OrientVertex dstVertex = orientGraph.addVertex(null, "nodeId", parts[1] );
Edge edge = orientGraph.addEdge(null, srcVertex, dstVertex, "similar");

无需调用 .commit()。

于 2013-11-12T20:59:34.300 回答
1

当您尝试比较多个数据库时,我建议将您的代码概括为蓝图。Flickr 数据集看起来适合BatchGraph图形包装器之类的大小。BatchGraph您可以调整提交大小并专注于管理加载的代码。这样,您可以拥有一个简单的类来加载所有不同的图表(您甚至可以轻松地将您的测试扩展到其他支持蓝图的图表)。

@Stefan 对内存提出了一个很好的观点……您可能需要提高-XmxJVM 上的设置来处理该数据。每个 Graph 以不同的方式处理内存(即使它们持久化到磁盘),如果您在同一个 JVM 中同时加载所有三个,我敢打赌那里会有一些争用。

如果您打算比您引用的 Flickr 数据集更大,那么BatchGraph可能是不对的。BatchGraph一般好到几亿个图形元素。当您开始谈论比这更大的图时,您可能想忘记我所说的关于尝试不特定于图的一些内容。对于要测试的每个图表,您可能希望使用最佳工具来完成这项工作。对于 Neo4j,这意味着Neo4jBatchGraph(如果这对您很重要,至少您仍在使用蓝图),对于 Titan 意味着Faunus或自定义编写的并行批处理加载器,对于 OrientDB OrientBatchGraph

于 2013-11-12T14:00:45.073 回答