论文:TensorFlow: A System for Large-Scale Machine Learning, $3.3 说:
We optimized TensorFlow for executing large sub- graphs repeatedly with low latency. Once the graph for a step has been pruned, placed, and partitioned, its sub- graphs are cached in their respective devices. A client session maintains the mapping from step definitions to cached subgraphs, so that a distributed step on a large graph can be initiated with one small message to each participating task. This model favours static, reusable graphs, but it can support dynamic computations using dynamic control flow, as the next subsection describes.
如何在这里理解“缓存在各自的设备中”?而且很多API都有'caching_device'参数,但是默认值是False,如何理解CACHE特性?
一般来说,缓存机制总是遵循'INVALID cache'策略,那么缓存策略是怎样的呢?
如果我们为多个GPU使用更多的克隆模型图并具有图之间的平行度,即更多的模型克隆将引用ps上的共享变量,每个克隆如何读取远程变量?它是否默认将变量缓存在某些本地设备上以减少网络通信?
更多细节:
A Tour of TensorFlow
https://arxiv.org/pdf/1610.01178.pdf
Finally, an important optimization made by TensorFlow at this step is “canonicalization” of (send,receive) pairs. In the setup displayed in Figure 5b, the existence of each recv node on device B would imply allocation and management of a separate buffer to store ν’s output tensor, so that it may then be fed to nodes α and β, respectively. However, an equivalent and more efficient transformation places only one recv node on device B, streams all output from ν to this single node, and then to the two dependent nodes α and β. This last and final evolution is given in Figure 5c.
上述文档描述了如果图 5c 自动进行优化以减少隐式读取操作。如果这种情况发生在分布式系统中,网络流量将根据需要自动减少。
以另一种方式,/model/slim/deployment/model_deploy.py 尝试创建缓存变量,如下所示:
562 def caching_device(self):
563 """Returns the device to use for caching variables.
564
565 Variables are cached on the worker CPU when using replicas.
566
567 Returns:
568 A device string or None if the variables do not need to be cached.
569 """
570 if self._num_ps_tasks > 0:
571 return lambda op: op.device
572 else:
573 return None
尝试优化网络流量,我想。
在分布式系统中进行通信优化的真正或最佳方法是什么?
我们也希望对此进行更清晰的说明,如果我得到更多的实验调整结果,我们将尝试更新此问题。