map - [Scala/Scalding]：将 ID 映射到名称

Question

我对 Scalding 相当陌生，我正在尝试编写一个将 2 个数据集作为输入的 scalding 程序：1) book_id_title: ('id,'title): 包含书 ID 和书名之间的映射，两者都是字符串。2) book_sim: ('id1, 'id2, 'sim)：包含书籍对之间的相似性，由它们的ID标识。

scalding 程序的目标是通过查找 book_id_title 表将 book_ratings 中的每个 (id1, id2) 替换为它们各自的标题。但是，我无法检索标题。如果有人可以帮助使用下面的 getTitle() 函数，我将不胜感激。

我的烫伤代码如下：

  // read in the mapping between book id and title from a csv file
  val book_id_title =
       Csv(book_file, fields=book_format)
         .read
         .project('id,'title)

   // read in the similarity data from a csv file and map the ids to the titles
   // by calling getTitle function
  val result = 
      book_sim
      .map(('id1, 'id2)->('title1, 'title2)) {
           pair:(String,String)=> (getTitle(pair._1), getTitle(pair._2))
       }
      .write(out)


  // function that searches for the id and retrieves the title
  def getTitle(search_id: String) = {
      val btitle = 
         book_id_title
           .filter('id){id:String => id == search_id} // extract row matching the id
           .project('title)  // get the title
   }

谢谢

score 1 · Accepted Answer

Hadoop 是一个批处理系统，无法按索引查找数据。相反，您需要通过 id 加入 book_id_title 和 book_sim，可能两次：左右 id。就像是：

book_sim.joinWithSmaller('id1->id, book_id_title).joinWithSmaller('id2->id, book_id_title)

我对基于字段的 API 不是很熟悉，所以将以上内容视为伪代码。您还需要添加适当的投影。希望它仍然能给你一个想法。

map - [Scala/Scalding]：将 ID 映射到名称

1 回答 1

Related

Reference