2

Now I have a set of numbers, such as1,4,10,23,..., and I would like to build a b-tree index for them using Apache Spark. The format is per line per record (separated by '/n'). And I have also no idea of the output file's format, I just want to find a recommend one

The regular way of building b-tree index are shown in https://en.wikipedia.org/wiki/B-tree, but I now would like a distributed parallel version in Apache Spark .

In addition, the Wiki of B-tree introduced a way to build a B-tree to represent a large existing collection of data.(see https://en.wikipedia.org/wiki/B-tree) It seems that I should sort it at advance, and I think for a big set of data, sorting is quite time-consuming and even can't be completed for limited memory. Is this method mentioned above a recommend one ?

4

1 回答 1

1

RDD.sort如果尚未排序,则对 RDD 进行排序。用于RDD.mapPartitions为每个分区建立索引。然后构建一个连接每个分区索引的顶级索引。

于 2015-03-07T10:13:21.467 回答