例如,RNN 是一个动态 3 层双向 LSTM,隐藏向量大小为 200 ( tf.nn.bidirectional_dynamic_rnn
),我有 4 个 GPU 来训练模型。我看到一篇文章在批次中使用data parallelism
样本子集,但这并没有加快训练过程。
问问题
1201 次
1 回答
1
您也可以尝试模型并行。一种方法是制作这样的单元格包装器,它将在特定设备上创建单元格:
class DeviceCellWrapper(tf.nn.rnn_cell.RNNCell):
def __init__(self, cell, device):
self._cell = cell
self._device = device
@property
def state_size(self):
return self._cell.state_size
@property
def output_size(self):
return self._cell.output_size
def __call__(self, inputs, state, scope=None):
with tf.device(self._device):
return self._cell(inputs, state, scope)
然后将每个单独的层放到专用 GPU 上:
cell_fw = DeviceCellWrapper(cell=tf.nn.rnn_cell.LSTMCell(num_units=n_neurons, state_is_tuple=False), device='/gpu:0')
cell_bw = DeviceCellWrapper(cell=tf.nn.rnn_cell.LSTMCell(num_units=n_neurons, state_is_tuple=False), device='/gpu:0')
outputs, states = tf.nn.bidirectional_dynamic_rnn(cell_fw, cell_bw, X, dtype=tf.float32)
于 2017-12-12T12:02:09.297 回答