1

我对 spark 实现的 lambda 架构做了一些研究,从下面的文章中我们知道合并批处理和实时视图的方法是使用 "realTimeView.unionAll(batchView).groupBy......" ,但是当batchView 后面的数据很大,用这种方式会不会有性能问题???

比如batchView后面的行数是100,000,000,那么spark每次客户端请求合并视图时都要groupBy 100,000,000行,这显然很慢。

https://dzone.com/articles/lambda-architecture-with-apache-spark

DataFrame realTimeView = streamingService.getRealTimeView();
DataFrame batchView = servingService.getBatchView();
DataFrame mergedView = realTimeView.unionAll(batchView)
                                   .groupBy(realTimeView.col(HASH_TAG.getValue()))
                               .sum(COUNT.getValue())
                               .orderBy(HASH_TAG.getValue());
List<Row> merged = mergedView.collectAsList();
return merged.stream()
.map(row -> new HashTagCount(row.getString(0), row.getLong(1)))
.collect(Collectors.toList());
4

0 回答 0