2

我正在使用tf-slim微调网络,vgg16. 我想通过对最后一层应用不同的学习率来手动操作梯度。但是当我尝试使用opt.minimize(), ortf.gradients()并且opt.apply_gradients()None在摘要报告中获得损失值时

为什么此代码路径train_op起作用:

optimizer = tf.train.GradientDescentOptimizer( learning_rate=.001 )
train_op = slim.learning.create_train_op(total_loss, optimizer,
                                        global_step=global_step)

slim.learning.train(train_op, log_dir, 
                    init_fn=init_fn,
                    global_step=global_step,
                    number_of_steps=25,
                    save_summaries_secs=300,
                    save_interval_secs=600                       
                   )

但是手动创建train_op失败并出现以下异常(例如total_lossis None):

trainable = tf.trainable_variables()
optimizer = tf.train.GradientDescentOptimizer(learning_rate=.001)
train_op = optimizer.minimize( total_loss, global_step=global_step )


# exception: appears that loss is None
--- Logging error ---
Traceback (most recent call last):
...
  File "/anaconda/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 755, in train
    sess, train_op, global_step, train_step_kwargs)
  File "/anaconda/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 506, in train_step
    np_global_step, total_loss, time_elapsed)
  File "/anaconda/anaconda3/lib/python3.6/logging/__init__.py", line 338, in getMessage
    msg = msg % self.args
TypeError: must be real number, not NoneType
...
Message: 'global step %d: loss = %.4f (%.3f sec/step)'
Arguments: (29, None, 51.91366386413574)

我在这里做错了什么?

4

2 回答 2

1

我的用例是对learning_rate模型的最后一个微调层应用一个不同的层——这似乎表明我必须使用第二个优化器。

在坚持使用框架会在以后得到回报的假设下,这就是我必须做的来拼凑一个tf.slim.create_train_op()接受多个optimizersand的等效函数grads_and_vars

    def slim_learning_create_train_op_with_manual_grads( total_loss, optimizers, grads_and_vars,
                global_step=0,                                                            
              #  update_ops=None,
              #  variables_to_train=None,
                clip_gradient_norm=0,
                summarize_gradients=False,
                gate_gradients=1,               # tf.python.training.optimizer.Optimizer.GATE_OP,
                aggregation_method=None,
                colocate_gradients_with_ops=False,
                gradient_multipliers=None,
                check_numerics=True):

        """Runs the training loop
                modified from slim.learning.create_train_op() to work with
                a matched list of optimizers and grads_and_vars

        Returns:
            train_ops - the value of the loss function after training.
        """
        from tensorflow.python.framework import ops
        from tensorflow.python.ops import array_ops
        from tensorflow.python.ops import control_flow_ops
        from tensorflow.python.training import training_util

        def transform_grads_fn(grads):
            if gradient_multipliers:
                with ops.name_scope('multiply_grads'):
                    grads = multiply_gradients(grads, gradient_multipliers)

            # Clip gradients.
            if clip_gradient_norm > 0:
                with ops.name_scope('clip_grads'):
                    grads = clip_gradient_norms(grads, clip_gradient_norm)
            return grads

        if global_step is None:
            global_step = training_util.get_or_create_global_step()

        assert len(optimizers)==len(grads_and_vars)

        ### order of processing:
        # 0. grads = opt.compute_gradients() 
        # 1. grads = transform_grads_fn(grads)
        # 2. add_gradients_summaries(grads)
        # 3. grads = opt.apply_gradients(grads, global_step=global_step) 

        grad_updates = []
        for i in range(len(optimizers)):
            grads = grads_and_vars[i]                               # 0. kvarg, from opt.compute_gradients()
            grads = transform_grads_fn(grads)                       # 1. transform_grads_fn()
            if summarize_gradients:
                with ops.name_scope('summarize_grads'):
                    slim.learning.add_gradients_summaries(grads)    # 2. add_gradients_summaries()
            if i==0:
                grad_update = optimizers[i].apply_gradients( grads, # 3. optimizer.apply_gradients()
                            global_step=global_step)                #    update global_step only once
            else:
                grad_update = optimizers[i].apply_gradients( grads )
            grad_updates.append(grad_update)

        with ops.name_scope('train_op'):
            total_loss = array_ops.check_numerics(total_loss,
                                            'LossTensor is inf or nan')
            train_op = control_flow_ops.with_dependencies(grad_updates, total_loss)

        # Add the operation used for training to the 'train_op' collection    
        train_ops = ops.get_collection_ref(ops.GraphKeys.TRAIN_OP)
        if train_op not in train_ops:
            train_ops.append(train_op)

        return train_op
于 2018-01-16T09:38:42.133 回答
1

问题是,尽管 name create_train_op()slim创建了一个不同于通常定义的返回类型train_op,这是您在使用“non-slim”调用时在第二种情况下使用的:

optimizer.minimize( total_loss, global_step=global_step )

试试这个:

optimizer = tf.train.GradientDescentOptimizer( learning_rate=.001 )
train_op_no_slim = optimizer.minimize(total_loss)
train_op = slim.learning.create_train_op(total_loss, optimizer)
print(train_op_no_slim)
print(train_op) 

首先,我得到“通常”(在张量流中):

name: "GradientDescent_2"
op: "NoOp"
input: "^GradientDescent_2/update_layer1/weight1/ApplyGradientDescent"
input: "^GradientDescent_2/update_layer1/bias1/ApplyGradientDescent"
input: "^GradientDescent_2/update_layer2/weight2/ApplyGradientDescent"
input: "^GradientDescent_2/update_layer2/bias2/ApplyGradientDescent"
input: "^GradientDescent_2/update_layer3/weight3/ApplyGradientDescent"
input: "^GradientDescent_2/update_layer3/bias3/ApplyGradientDescent"

对于第二个print陈述,我得到:

Tensor("train_op_1/control_dependency:0", shape=(), dtype=float32)

简而言之,slim.learning.create_train_op它的返回类型与optimizer.minimize().

要解决此问题:您对直接定义的使用train_op将您带出标准slim领域。我建议接受这一点,并仅train_op在以非苗条方式直接定义的内容上进行操作,使用sess.run()train_op.run()作为典型(非苗条)张量流示例

于 2018-01-16T02:43:50.533 回答