您好,我正在尝试在 git 存储库上运行 map reduce 作业。我想使用映射作业首先将所有存储库同时克隆到 hdfs,然后对文件执行进一步的映射缩减作业。我遇到了一个问题,我不确定如何将存储库文件写入 hdfs。我见过编写单个文件的示例,但那些在映射器之外并且只写单个文件。jgit api 仅公开一个从文件继承的文件存储库结构,但 hdfs 使用写为数据输出流的路径。有没有一种很好的方法可以在两者或任何类似的例子之间进行转换?
谢谢
The input data to Hadoop Mapper must be on HDFS and not on your local machine or anything other than HDFS. Map-reduce jobs are not meant for migrating data from one place to another. They are used to process huge volumes of data present on HDFS. I am sure that your repository data in not HDFS, and if it is then you wont have needed to perform any operation at first place. So please keep in mind that map-reduce jobs are used for processing large volumes of data already present on HDFS (Hadoop file system).