1

我正在并行训练两个子模型,并希望在它们都完成后对它们进行平均。

该模型是用 tensor2tensor 实现的(但仍然使用 tensorflow)。部分定义是这样的:

    def build_model(self):
        layer_sizes = [n,n,n]
        kernel_sizes = [n,n,n]
        tf.reset_default_graph()
        self.graph = tf.Graph()
        with self.graph.as_default():
            self.num_classes = num_classes
            # placeholder for parameter
            self.learning_rate = tf.placeholder(tf.float32, name="learning_rate")
            self.dropout_keep_prob = tf.placeholder(tf.float32, name="dropout_keep_prob")
            self.phase = tf.placeholder(tf.bool, name="phase")
            # Placeholders for regular data
            self.input_x = tf.placeholder(tf.float32, [None, None, input_feature_dim], name="input_x")
            self.input_y = tf.placeholder(tf.float32, [None, num_classes], name="input_y")
            h = self.input_x
            ......[remaining codes]

我将其保存如下:

    def save_model(sess, output):
        saver = tf.train.Saver()
        save_path = saver.save(sess, os.path.join(output, 'model'))

当我加载它时,我使用:

    def load_model(self, sess, input_dir, logger):
        if logger is not None:
            logger.info("Start loading graph ...")
        saver = tf.train.import_meta_graph(os.path.join(input_dir, 'model.meta'))
        saver.restore(sess, os.path.join(input_dir, 'model'))
        self.graph = sess.graph
        self.input_x = self.graph.get_tensor_by_name("input_x:0")
        self.input_y = self.graph.get_tensor_by_name("input_y:0")
        self.num_classes = self.input_y.shape[1]
        self.learning_rate = self.graph.get_tensor_by_name("learning_rate:0")
        self.dropout_keep_prob = self.graph.get_tensor_by_name("dropout_keep_prob:0")
        self.phase = self.graph.get_tensor_by_name("phase:0")
        self.loss = self.graph.get_tensor_by_name("loss:0")
        self.optimizer = self.graph.get_operation_by_name("optimizer")
        self.accuracy = self.graph.get_tensor_by_name("accuracy/accuracy:0")

我使用 avg_checkpoint 来平均两个子模型:

python utils/avg_checkpoints.py
  --checkpoints path/to/checkpoint1,path/to/checkpoint2
  --num_last_checkpoints 2
  --output_path where/to/save/the/output

但是当我进一步检查 avg_checkpoints 代码时发现问题:

    for checkpoint in checkpoints:
        reader = tf.train.load_checkpoint(checkpoint)
        for name in var_values:
            tensor = reader.get_tensor(name)
            var_dtypes[name] = tensor.dtype
            var_values[name] += tensor
        tf.logging.info("Read from checkpoint %s", checkpoint)
    for name in var_values:  # Average.
        var_values[name] /= len(checkpoints)

    with tf.variable_scope(tf.get_variable_scope(), reuse=tf.AUTO_REUSE):
        tf_vars = [
            tf.get_variable(v, shape=var_values[v].shape, dtype=var_dtypes[v])
            for v in var_values
        ]
    placeholders = [tf.placeholder(v.dtype, shape=v.shape) for v in tf_vars]
    assign_ops = [tf.assign(v, p) for (v, p) in zip(tf_vars, placeholders)]
    global_step = tf.Variable(
        0, name="global_step", trainable=False, dtype=tf.int64)
    saver = tf.train.Saver(tf.all_variables())

    # Build a model consisting only of variables, set them to the average values.
    with tf.Session() as sess:
        for p, assign_op, (name, value) in zip(placeholders, assign_ops,
                                               six.iteritems(var_values)):
            sess.run(assign_op, {p: value})
        # Use the built saver to save the averaged checkpoint.
        saver.save(sess, FLAGS.output_path, global_step=global_step)

它只保存变量,而不是所有张量。EG 当我使用上面的 load_model 函数加载它时,它不能有“input_x:0”张量。这个脚本是错误的,还是我应该根据我的使用来修改它?

我正在使用 TF r1.13。谢谢!

4

0 回答 0