hadoop - Crunch SparkPipeline 无法按预期工作

翻译自：https://stackoverflow.com/questions/35218526 2016-02-05T07:29:10.303

69 次

我正在尝试将我们的代码从 Crunch MRPipeline 迁移到 SparkPipeline。我尝试了一个像这样的简单示例

SparkConf sc = new SparkConf().setAppName("Crunch Spark Count").setMaster("local");
JavaSparkContext jsc = new JavaSparkContext(sc);
SparkPipeline p = new SparkPipeline(jsc, "Crunch Spark Count");
PCollection<String> lines = p.read(From.textFile(new Path(fileUrl)));
PCollection<String> words = lines.parallelDo(new Tokenizer(), Writables.strings());
PTable<String, Long> counts = words.count();

我的输入文件就像 file1: hello world hello hadoop file2: hello spark

运行spark程序后，输出结果总是

[hello, 1]
[hadoop, 1]
[world, 1]
[spark, 1]

实际上，hello的计数应该是3

那是 Crunch 'count' 功能错误？

hadoop - Crunch SparkPipeline 无法按预期工作

0 回答 0

Related

Reference