java - 如何将序列化对象放入 Hadoop DFS 并将其返回到 map 函数中？

Question

我是 Hadoop 新手，最近我被要求使用 Hadoop 做一个测试项目。所以当我在阅读 BigData 时，碰巧知道 Pail。现在我想做的是这样的。首先创建一个简单的对象，然后使用 Thrift 对其进行序列化，并使用 Pail 将其放入 HDFS。然后我想在 map 函数中获取该对象并做任何我想做的事情。但我不知道在 map 函数中获取 tat 对象。

有人可以告诉我任何参考资料或解释如何做到这一点吗？

score 0 · Accepted Answer

I can think of three options:

Use the -files option and name the file in HDFS (preferable as the task tracker will download the file once for all jobs running on that node)
Use the DistributedCache (similar logic to the above), but you configure the file via some API calls rather than through the command line
Load the file directly from HDFS (less efficient as you're pulling the file over HDFS for each task)

As for some code, put the load logic into your mapper's setup(...) or configure(..) method (depending on whether you're using the new or old API) as follows:

protected void setup(Context context) {
    // the -files option makes the named file available in the local directory
    File file = new File("filename.dat");
    // open file and load contents ...

    // load the file directly from HDFS
    FileSystem fs = FileSystem.get(context.getConfiguration());
    InputStream hdfsInputStream = fs.open("/path/to/file/in/hdfs/filename.dat");
    // load file contents from stream...
}

DistributedCache has some example code in the Javadocs

java - 如何将序列化对象放入 Hadoop DFS 并将其返回到 map 函数中？

1 回答 1

Related

Reference