我对 spark 实现的 lambda 架构做了一些研究,从下面的文章中我们知道合并批处理和实时视图的方法是使用 "realTimeView.unionAll(batchView).groupBy......" ,但是当batchView 后面的数据很大,用这种方式会不会有性能问题???
比如batchView后面的行数是100,000,000,那么spark每次客户端请求合并视图时都要groupBy 100,000,000行,这显然很慢。
https://dzone.com/articles/lambda-architecture-with-apache-spark
DataFrame realTimeView = streamingService.getRealTimeView();
DataFrame batchView = servingService.getBatchView();
DataFrame mergedView = realTimeView.unionAll(batchView)
.groupBy(realTimeView.col(HASH_TAG.getValue()))
.sum(COUNT.getValue())
.orderBy(HASH_TAG.getValue());
List<Row> merged = mergedView.collectAsList();
return merged.stream()
.map(row -> new HashTagCount(row.getString(0), row.getLong(1)))
.collect(Collectors.toList());