hadoop - Giraph best 的顶点输入格式，用于 id 类型为 String 的输入文件

Question

我有一个多节点 giraph 集群在我的 PC 中正常工作。我从 Giraph 执行了 SimpleShortestPathExample 并且执行得很好。

这个算法是用这个文件（tiny_graph.txt）运行的：

[0,0,[[1,1],[3,3]]]
[1,0,[[0,1],[2,2],[3,1]]]
[2,0,[[1,2],[4,4]]]
[3,0,[[0,3],[1,1],[4,4]]]
[4,0,[[3,4],[2,4]]]

该文件具有以下输入格式：

[source_id,source_value,[[dest_id, edge_value],...]]

现在，我正在尝试在同一个集群中执行相同的算法，但输入文件与原始文件不同。我自己的文件是这样的：

[Portada,0,[[Sugerencias para la cita del día,1]]]
[Proverbios españoles,0,[]]
[Neil Armstrong,0,[[Luna,1][ideal,1][verdad,1][Categoria:Ingenieros,2,[Categoria:Estadounidenses,2][Categoria:Astronautas,2]]]
[Categoria:Ingenieros,1,[[Neil Armstrong,2]]]
[Categoria:Estadounidenses,1,[[Neil Armstrong,2]]]
[Categoria:Astronautas,1,[[Neil Armstrong,2]]]

它与原版非常相似，但 id 是 String，顶点和边值是 Long。我的问题是我应该使用哪个 TextInputFormat ，因为我已经尝试过org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat并且org.apache.giraph.io.formats.TextDoubleDoubleAdjacencyListVertexInputFormat我无法让它工作。

解决了这个问题后，我可以调整原始的最短路径示例算法并让它适用于我的文件，但是在我得到解决方案之前，我无法达到这一点。

如果这种格式不是一个好的决定，我可能会调整它，但我不知道哪个是我最好的选择，我从 giraph 中的文本输入和输出格式中获得的知识真的很糟糕，这就是为什么 i0me 在这里寻求建议。

score 3 · Accepted Answer

最好编写自己的输入格式。我建议使用你的字符串的哈希码。我编写了一个示例代码，每行包含：[vertex_id（整数，例如字符串的哈希码），vertex_val（长），[[neighbor_id（整数），neighbor_val（长）]，...]

public class JsonIntLongIntLongVertexInputFormat extends
  TextVertexInputFormat<IntWritable, LongWritable, LongWritable> {

  @Override
  public TextVertexReader createVertexReader(InputSplit split,
      TaskAttemptContext context) {
    return new JsonIntLongIntLongVertexReader();
  }


  class JsonIntLongIntLongVertexReader extends
    TextVertexReaderFromEachLineProcessedHandlingExceptions<JSONArray,
    JSONException> {

    @Override
    protected JSONArray preprocessLine(Text line) throws JSONException     {
      return new JSONArray(line.toString());
    }

    @Override
    protected IntWritable getId(JSONArray jsonVertex) throws JSONException,
              IOException {
      return new IntWritable(jsonVertex.getString(0).hashCode());
    }

    @Override
    protected LongWritable getValue(JSONArray jsonVertex) throws
      JSONException, IOException {
      return new LongWritable(jsonVertex.getLong(1));
    }

    @Override
    protected Iterable<Edge<IntWritable, LongWritable>> getEdges(
        JSONArray jsonVertex) throws JSONException, IOException {
      JSONArray jsonEdgeArray = jsonVertex.getJSONArray(2);
      List<Edge<IntWritable, LongWritable>> edges =
          Lists.newArrayListWithCapacity(jsonEdgeArray.length());
      for (int i = 0; i < jsonEdgeArray.length(); ++i) {
        JSONArray jsonEdge = jsonEdgeArray.getJSONArray(i);
        edges.add(EdgeFactory.create(new IntWritable(jsonEdge.getString(0).hashCode()),
            new LongWritable(jsonEdge.getLong(1))));
      }
      return edges;
    }

    @Override
    protected Vertex<IntWritable, LongWritable, LongWritable>
    handleException(Text line, JSONArray jsonVertex, JSONException e) {
      throw new IllegalArgumentException(
          "Couldn't get vertex from line " + line, e);
    }

  }
}

score 1 · Accepted Answer

我解决了这个调整我自己的文件以适应org.apache.giraph.io.formats.TextDoubleDoubleAdjacencyListVertexInputFormat. 我的原始文件应该是这样的：

Portada 0.0     Sugerencias     1.0
Proverbios      0.0
Neil    0.0     Luna    1.0     ideal   1.0     verdad  1.0     Categoria:Ingenieros    2.0     Categoria:Estadounidenses       2.0     Categoria:Astronautas   2.0
Categoria:Ingenieros    1.0     Neil    2.0
Categoria:Estadounidenses       1.0     Neil    2.0
Categoria:Astronautas   1.0     Neil    2.0

数据之间的那些空格是制表符空格（'\t'），因为这种格式具有该选项作为预定标记值，用于将原始行拆分为多个字符串。

无论如何感谢@masoud-sagharichian 的帮助！！:D

hadoop - Giraph best 的顶点输入格式，用于 id 类型为 String 的输入文件

2 回答 2

Related

Reference