I am wondering if someone can explain how the distributed cache works in Hadoop. I am running a job many times, and after each run I notice that the local distributed cache folder on each node is growing in size.
Is there a way for multiple jobs to re-use the same file in the distributed cache? Or is the distributed cache only valid for the lifetime of any individual job?
The reason I am confused is that the Hadoop documentation mentions that "DistributedCache tracks modification timestamps of the cache files", so this leads me to believe that if the time stamp hasn't changed, then it should not need to re-cache or re-copy the files to the nodes.
I am adding files successfully to the distributed cache using:
DistributedCache.addFileToClassPath(hdfsPath, conf);