我正在尝试编写一个 Spark 应用程序,它输出以每个字母开头的单词数。我收到一个字符串索引超出范围错误。有什么建议,还是我没有以正确的方式解决这个 map-reduce 问题?
public class Main {
public static void main(String[] args) throws Exception{
//Tell spark to access a cluster
SparkConf conf = new SparkConf().setAppName("App").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
System.out.printf("%d lines\n", sc.textFile("pg100.txt").count());
//MARK: Mapping
//Read target file into an Resilient Distributed Dataset(RDD)
JavaRDD<String> lines = sc.textFile("pg100.txt");
//Split lines into individual words by converting each line into an array of words
//Treat all words as lowercase
//Ignore non-alphabetic characters
JavaRDD<String> words = lines.flatMap(line -> Arrays.asList(line.split(" ")).iterator()).map(line -> line.replaceAll("[^a-zA-Z0-9_-]","").replaceAll("\\.", "").toLowerCase());
//MARK: Sorting
//Count the total number of words that start with each letter
JavaPairRDD<Character, Integer> letters = words.mapToPair(w -> new Tuple2<>(w.charAt(0), 1));
//MARK: Reducing
//Get count of number of instances of each word
JavaPairRDD<Character, Integer> counts = letters.reduceByKey((n1,n2) -> n1 + n2);
counts.saveAsTextFile("result");
sc.stop();
}
}