apache-spark - 如何让 Tachyon 在 Spark 作业之间共享数据

Question

我是 Tachyon 的初学者。我想在 spark 作业之间共享一些数据或 rdd。超光速粒子概述说

Tachyon 是一个开源的以内存为中心的分布式存储系统，能够以内存速度跨集群作业进行可靠的数据共享。

但我不知道如何启用它。我只知道 tachyon 可以充当 Spark 中的堆外缓存层。谢谢。

score 0 · Accepted Answer

I don't think you need to do it explicitly, Alluxio will help you manage the data sharing.

Assume you have two spark jobs A and B and they're configured to fetch data from Alluxio.

Assume there is no data in Alluxio yet and job A and job B are executed in a batch. When job A is running, Alluxio will firstly fetch data from UFS, serve compute needs and cache data to its local storage like memory. When job B wants data for query, Alluxio will check its local storage firstly to serve job B's need. It will fetch data from UFS only if cache is missed. The data is now shared through different jobs.

So in a nutshell, I think the data sharing here is actually the cache you mentioned.

apache-spark - 如何让 Tachyon 在 Spark 作业之间共享数据

1 回答 1

Related

Reference