java - Map-reduce 使用 java - java.lang.StringIndexOutOfBoundsException：字符串索引超出范围：0

Question

我正在尝试编写一个 Spark 应用程序，它输出以每个字母开头的单词数。我收到一个字符串索引超出范围错误。有什么建议，还是我没有以正确的方式解决这个 map-reduce 问题？

public class Main {
    public static void main(String[] args) throws Exception{

        //Tell spark to access a cluster
        SparkConf conf = new SparkConf().setAppName("App").setMaster("local");
        JavaSparkContext sc = new JavaSparkContext(conf);
        System.out.printf("%d lines\n", sc.textFile("pg100.txt").count());


        //MARK: Mapping
        //Read target file into an Resilient Distributed Dataset(RDD)
        JavaRDD<String> lines = sc.textFile("pg100.txt");

        //Split lines into individual words by converting each line into an array of words
        //Treat all words as lowercase
        //Ignore non-alphabetic characters
        JavaRDD<String> words = lines.flatMap(line -> Arrays.asList(line.split(" ")).iterator()).map(line -> line.replaceAll("[^a-zA-Z0-9_-]","").replaceAll("\\.", "").toLowerCase());

        //MARK: Sorting
        //Count the total number of words that start with each letter
        JavaPairRDD<Character, Integer> letters = words.mapToPair(w -> new Tuple2<>(w.charAt(0), 1));

        //MARK: Reducing
        //Get count of number of instances of each word
        JavaPairRDD<Character, Integer> counts = letters.reduceByKey((n1,n2) -> n1 + n2);

        counts.saveAsTextFile("result");
        sc.stop();

    }
}

score 0 · Accepted Answer

我怀疑某些单词仅由以下行替换的字符组成：

JavaRDD<String> words = lines.flatMap(line -> Arrays.asList(line.split(" ")).iterator()).map(line -> line.replaceAll("[^a-zA-Z0-9_-]","").replaceAll("\\.", "").toLowerCase());

结果，一些单词变成了空字符串并且仍然保留在wordsRDD 中，当您尝试访问它们的 index=0 时，您自然会收到您提到的异常。

您可能认为如果 map 产生了空字符串，它就不会被包含到words中，这是不正确的。

UPD。您可以通过这种方式过滤掉空字符串：

words.filter(line -> !line.equals(""));

java - Map-reduce 使用 java - java.lang.StringIndexOutOfBoundsException：字符串索引超出范围：0

1 回答 1

Related

Reference