0

我正在尝试编写一个 Spark 应用程序,它输出以每个字母开头的单词数。我收到一个字符串索引超出范围错误。有什么建议,还是我没有以正确的方式解决这个 map-reduce 问题?

public class Main {
    public static void main(String[] args) throws Exception{

        //Tell spark to access a cluster
        SparkConf conf = new SparkConf().setAppName("App").setMaster("local");
        JavaSparkContext sc = new JavaSparkContext(conf);
        System.out.printf("%d lines\n", sc.textFile("pg100.txt").count());


        //MARK: Mapping
        //Read target file into an Resilient Distributed Dataset(RDD)
        JavaRDD<String> lines = sc.textFile("pg100.txt");

        //Split lines into individual words by converting each line into an array of words
        //Treat all words as lowercase
        //Ignore non-alphabetic characters
        JavaRDD<String> words = lines.flatMap(line -> Arrays.asList(line.split(" ")).iterator()).map(line -> line.replaceAll("[^a-zA-Z0-9_-]","").replaceAll("\\.", "").toLowerCase());

        //MARK: Sorting
        //Count the total number of words that start with each letter
        JavaPairRDD<Character, Integer> letters = words.mapToPair(w -> new Tuple2<>(w.charAt(0), 1));

        //MARK: Reducing
        //Get count of number of instances of each word
        JavaPairRDD<Character, Integer> counts = letters.reduceByKey((n1,n2) -> n1 + n2);

        counts.saveAsTextFile("result");
        sc.stop();

    }
}
4

1 回答 1

0

我怀疑某些单词仅由以下行替换的字符组成:

JavaRDD<String> words = lines.flatMap(line -> Arrays.asList(line.split(" ")).iterator()).map(line -> line.replaceAll("[^a-zA-Z0-9_-]","").replaceAll("\\.", "").toLowerCase());

结果,一些单词变成了空字符串并且仍然保留在wordsRDD 中,当您尝试访问它们的 index=0 时,您自然会收到您提到的异常。

您可能认为如果 map 产生了空字符串,它就不会被包含到words中,这是不正确的。

UPD。您可以通过这种方式过滤掉空字符串:

words.filter(line -> !line.equals(""));
于 2019-09-18T02:00:46.120 回答