hadoop - Re-use files in Hadoop Distributed cache

Question

I am wondering if someone can explain how the distributed cache works in Hadoop. I am running a job many times, and after each run I notice that the local distributed cache folder on each node is growing in size.

Is there a way for multiple jobs to re-use the same file in the distributed cache? Or is the distributed cache only valid for the lifetime of any individual job?

The reason I am confused is that the Hadoop documentation mentions that "DistributedCache tracks modification timestamps of the cache files", so this leads me to believe that if the time stamp hasn't changed, then it should not need to re-cache or re-copy the files to the nodes.

I am adding files successfully to the distributed cache using:

DistributedCache.addFileToClassPath(hdfsPath, conf);

score 2 · Accepted Answer

DistributedCache 使用引用计数来管理缓存。org.apache.hadoop.filecache.TrackerDistributedCacheManager.CleanupThread负责清理引用计数为0的CacheDirs。它会每分钟检查一次（默认周期为1分钟，您可以通过“mapreduce.tasktracker.distributedcache.checkperiod”设置）。

当 Job 完成或失败时，JobTracker 将向org.apache.hadoop.mapred.KillJobActionTaskTrackers 发送一个。然后，如果 TaskTracker 收到 KillJobAction，它会将操作放入 tasksToCleanup。在 TaskTracker 中，有一个名为 taskCleanupThread 的后台线程，它从 tasksToCleanup 中获取操作并执行清理工作。对于 KillJobAction，它将调用 purgeJob 来清理作业。在此方法中，它将减少此 Job ( rjob.distCacheMgr.release();) 使用的引用计数。

以上分析基于hadoop-core-2.0.0-mr1-cdh4.2.1-sources.jar. 我还检查了hadoop-core-0.20.2-cdh3u1-sources.jar，发现这两个版本之间存在细微差别。例如，没有org.apache.hadoop.filecache.TrackerDistributedCacheManager.CleanupThreadin 0.20.2-cdh3u1。当初始化一个 Job 时，TrackerDistributedCacheManager 会检查是否有足够的空间来放置这个 Job 的新缓存文件。如果没有，它将删除引用计数为 0 的缓存。

如果你使用的是cdh4.2.1，你可以增加“mapreduce.tasktracker.distributedcache.checkperiod”来让清理工作延迟。那么多个 Job 使用同一个分布式缓存的概率就会增加。

如果你使用cdh3u1，你可以增加缓存大小的限制（“local.cache.size”，默认为10G）和缓存的最大目录（“mapreduce.tasktracker.cache.local.numberdirectories”，默认为10000 ）。这也可以应用于 cdh4.2.1。

score 0 · Accepted Answer

如果你仔细看这本书的话，分布式缓存中可以存储的内容是有限制的。默认为 10GB（可配置）。集群中可以同时运行多个不同的作业。此外，Hadoop 可以保证文件在单个作业的缓存中保持可用，因为它是由 tasktracker 为访问缓存中文件的不同任务完成的引用计数来维护的。在您的情况下，对于后续作业，文件可能不存在，因为它们已被标记为删除。

如果您在任何地方不同意，请纠正我。我很乐意进一步讨论这个问题。

score 0 · Accepted Answer

据此：http ://www.datasalt.com/2011/05/handling-dependencies-and-configuration-in-java-hadoop-projects-efficiently/

您应该能够通过 DistributedCache API 而不是“-libjars”来做到这一点

hadoop - Re-use files in Hadoop Distributed cache

3 回答 3

Related

Reference