r - 我可以在 Rhadoop 的 mapreduce 作业中使用 readLines 吗？

Question

我正在尝试从 HDFS 读取文本或 gz 文件并运行一个简单的 mapreduce 作业（实际上只有 map 作业），但出现错误，看起来 readLines 部分不起作用。我正在寻找是否可以在 mapreduce 中使用 readLines 函数的答案。附言。如果我只使用 readLines 函数在 mapreduce 作业之外解析 HDFS 文件，则没有问题。谢谢。

counts <- function(path){
        ct.map <- function(., lines) {
        line <- readLines(lines)
        word <- unlist(strsplit(line, pattern = " "))
        keyval(word, 1)
    }

    mapreduce(
    input = path,
    input.format = "text",
    map = ct.map
        )
}
counts("/user/ychen/100.txt")

score 0 · Accepted Answer

不像那样 - 映射函数需要 dfs 格式的数据进入。你可以像这样重写你的函数，在输入步骤中格式化：

counts <- function(path){
  ct.map <- function(.,line) {
    word <- unlist(strsplit(line, split = " "))
    keyval(word, 1)
  }

  mapreduce(
    input = to.dfs(readLines(path)),
    map = function(k,v){ct.map(k,v)},
    reduce = function(k,v){keyval(k,length(v))}
  )
}
output<-from.dfs(counts("/user/ychen/100.txt"))

我还添加了一个减少步骤，以求和这些值。

r - 我可以在 Rhadoop 的 mapreduce 作业中使用 readLines 吗？

1 回答 1

Related

Reference