apache-spark - 加入顶点时，我是否被迫使用 MEMORY_ONLY 缓存？

Question

看着源头outerJoinVertices

我想知道这是一个错误还是一个功能

override def outerJoinVertices[U: ClassTag, VD2: ClassTag]
      (other: RDD[(VertexId, U)])
      (updateF: (VertexId, VD, Option[U]) => VD2)
      (implicit eq: VD =:= VD2 = null): Graph[VD2, ED] = {
    // The implicit parameter eq will be populated by the compiler if VD and VD2 are equal, and left
    // null if not
    if (eq != null) {
      vertices.cache() // <===== what if I wanted it serialized? 
      // updateF preserves type, so we can use incremental replication
      val newVerts = vertices.leftJoin(other)(updateF).cache()
      val changedVerts = vertices.asInstanceOf[VertexRDD[VD2]].diff(newVerts)
      val newReplicatedVertexView = replicatedVertexView.asInstanceOf[ReplicatedVertexView[VD2, ED]]
        .updateVertices(changedVerts)
      new GraphImpl(newVerts, newReplicatedVertexView)
    } else {
      // updateF does not preserve type, so we must re-replicate all vertices
      val newVerts = vertices.leftJoin(other)(updateF)
      GraphImpl(newVerts, replicatedVertexView.edges)
    }
  }

问题

如果我的图形/连接的顶点已经通过另一个StorageLevel（例如MEMORY_ONLY_SER）缓存 - 这是导致的原因org.apache.spark.graphx.impl.ShippableVertexPartitionOps ... WARN ShippableVertexPartitionOps: Joining two VertexPartitions with different indexes is slow.吗？
如果是这种情况，那么这是 Spark 中的错误（来自 1.3.1）吗？如果是的话，找不到关于这个的 JIRA 问题（但我看起来不太难......）
为什么修复这个方法不像为这个方法提供一个新的 StorageLevel 那样简单？
有什么解决方法？（我能想到的一个是用 vertices.join(otherVertices) 和 originalGraph.edges 之类的东西创建一个新的 Graph ......但感觉不对......

score 1 · Accepted Answer

好吧，我认为这实际上不是一个错误。

查看它的代码会覆盖缓存方法，并使用用于创建此顶点VertexRDD的原始代码。StorageLevel

  override def cache(): this.type = {
    partitionsRDD.persist(targetStorageLevel)
    this
  }

apache-spark - 加入顶点时，我是否被迫使用 MEMORY_ONLY 缓存？

1 回答 1

Related

Reference