到目前为止,我有一个 JavaDStream,它最初看起来像这样:
Value
---------------------
a,apple,spain
b,orange,italy
c,apple,italy
a,apple,italy
a,orange,greece
首先,我拆分了行并将其映射到 JavaPairDStream 中的键值对:
JavaPairDStream<String, String> pairDStream = inputStream.mapToPair(row -> {
String[] cols = row.split(",");
String key = cols[0];
String value = cols[1] + "," + cols[2];
return new Tuple2<String, String>(key, value);
});
所以我得到了这个:
Key | Value
---------------------
a | apple,spain
b | orange,italy
c | apple,italy
a | apple,italy
a | orange,greece
最后,输出应该是这样的
Key | Fruit | Country
-------------------------------
a | 2 | 3
b | 1 | 1
c | 1 | 1
它计算每个键的独特水果和国家的数量。
现在最好的做法是什么?首先 groupByKey/reduceByKey 然后再拆分?或者是否可以像这样的键值对中的每个键有两个值?:
Key | Value1 | Value2
----------------------
a | apple | spain
b | orange | italy
c | apple | italy
a | apple | italy
a | orange | greece