Now I have a set of numbers, such as1,4,10,23,...
, and I would like to build a b-tree index
for them using Apache Spark
. The format is per line per record (separated by '/n'). And I have also no idea of the output file's format, I just want to find a recommend one
The regular way of building b-tree
index are shown in https://en.wikipedia.org/wiki/B-tree, but I now would like a distributed parallel version in Apache Spark
.
In addition, the Wiki of B-tree
introduced a way to build a B-tree to represent a large existing collection of data.(see https://en.wikipedia.org/wiki/B-tree) It seems that I should sort it at advance, and I think for a big set of data, sorting is quite time-consuming and even can't be completed for limited memory. Is this method mentioned above a recommend one ?