scala - 如何使用加入的 RDD

Question

假设我有一个名为 1.txt 和 2.txt 的文本文件。1.txt 包含

1,9,5
2,7,4
3,8,3

和 2.txt 包含

1,g,h
2,i,j
3,k,l

所以，我通过他们的键（第一列）加入了两者：

val one = sc.textFile("1.txt").map{
  line => val parts = line.split(",",-1)
    (parts(0),(parts(1),parts(2)))
}

val one = sc.textFile("2.txt").map{
  line => val parts = line.split(",",-1)
    (parts(0),(parts(1),parts(2)))
}

现在，如果我理解正确，我得到

(1,  (  (9,5), (g,h)  ))
(2,  (  (7,4), (i,j)  ))
(3,  (  (8,3), (k,l)  ))

现在，假设我需要总结1.txt第二列的所有值，

我该怎么做呢？
如何在加入的RDD中引用2.txt（即g，i，k）的第二列？
有没有很好的使用 RDD 的教程？我是一个火花（和斯卡拉）新手。

score 3 · Accepted Answer

加入真的很容易val joined = one.join(two)（注意到你出于某种原因命名了两个 RDD one，假设你打算给他们不同的名字）
scala 中的元组语法是tuple._number，所以总结 1.txt 列，如果joined是你所做的连接 RDD val sum = joined.map(_._2._1._2.toInt).reduce(_+_)，如果这些文件真的很大，可能想要在映射中转换为 long 甚至 BigInt。
我想说最好的 spark 教程是他们的主站点，amp camp 的东西，我个人喜欢浏览源代码和 scaladocs。对于 scala，“在 scala 中编程”是一个好的开始。

整个程序，稍微重写以使用更好的 scala 风格（免责声明，不是 scala 专家）

val one = sc.textFile("1.txt").map{
  _.split(",", -1) match {
    case Array(a, b, c) => (a, ( b, c))
  }
}

val two = sc.textFile("2.txt").map{
    _.split(",", -1) match {
      case Array(a, b, c) => (a, (b, c)) 
    }
    //looks like these two map functions are the same, could refactor into a lambda or non member function
}

val joined = one.join(two)

val sum = joined.map {
     case (_, ((_, num2), (_, _))) => num2.toInt 
}.reduce(_ + _)

scala - 如何使用加入的 RDD

1 回答 1

Related

Reference