嗨,我在使用 numpy 在 python 中实现神经网络时计算检查梯度时遇到问题。我正在使用mnist数据集尝试并尝试使用小批量梯度下降。
我已经检查了数学并且在纸上看起来不错,所以也许你可以给我一个提示这里发生了什么:
编辑:一个答案让我意识到成本函数确实计算错误。然而这并不能解释梯度的问题,因为它是使用 back_prop 计算的。使用30 个 epoch 和 100 个批次的minibatch gradient下降,我在隐藏层中使用 300 个单位得到 %7 的错误率。rmsprop(learning_rate= 0.001,由于 rmsprop 而小)。
每个输入都有 768 个特征,所以对于 100 个样本,我有一个矩阵。Mnist有10个班。
X = NoSamplesxFeatures = 100x768
Y = NoSamplesxClasses =  100x10
我在完全训练时使用一个隐藏层大小为 300 的隐藏层神经网络。我的另一个问题是我是否应该使用 softmax 输出函数来计算误差……我认为不是。但我对这一切还是个新手,显而易见的事情对我来说可能很奇怪。
(注意:我知道代码很难看,但这是我在压力下完成的第一个 Python/Numpy 代码,请多多包涵)
这是 back_prof 和激活:
  def sigmoid(z):
     return np.true_divide(1,1 + np.exp(-z) )
  #not calculated really - this the fake version to make it faster. 
  def sigmoid_prime(a):
     return  (a)*(1 - a)
  def _back_prop(self,W,X,labels,f=sigmoid,fprime=sigmoid_prime,lam=0.001):
    """
    Calculate the partial derivates of the cost function using backpropagation.
    """     
    #Weight for first layer and hidden layer
    Wl1,bl1,Wl2,bl2  = self._extract_weights(W)
    # get the forward prop value
    layers_outputs = self._forward_prop(W,X,f)
    #from a number make a binary vector, for mnist 1x10 with all 0 but the number.
    y = self.make_1_of_c_encoding(labels)
    num_samples = X.shape[0] # layers_outputs[-1].shape[0]
    # Dot product return  Numsamples (N) x Outputs (No CLasses)
    # Y is NxNo Clases
    # Layers output to
    big_delta = np.zeros(Wl2.size + bl2.size + Wl1.size + bl1.size)
    big_delta_wl1, big_delta_bl1, big_delta_wl2, big_delta_bl2 = self._extract_weights(big_delta)
    # calculate the gradient for each training sample in the batch and accumulate it
    for i,x in enumerate(X):
        # Error with respect  the output
        dE_dy =  layers_outputs[-1][i,:] -  y[i,:] 
        # bias hidden layer
        big_delta_bl2 +=   dE_dy
        # get the error for the hiddlen layer
        dE_dz_out  = dE_dy * fprime(layers_outputs[-1][i,:])
        #and for the input layer
        dE_dhl = dE_dy.dot(Wl2.T)
        #bias input layer
        big_delta_bl1 += dE_dhl
        small_delta_hl = dE_dhl*fprime(layers_outputs[-2][i,:])
        #here calculate the gradient for the weights in the hidden and first layer
        big_delta_wl2 += np.outer(layers_outputs[-2][i,:],dE_dz_out)
        big_delta_wl1 +=   np.outer(x,small_delta_hl)
    # divide by number of samples in the batch (should be done here)?
    big_delta_wl2 = np.true_divide(big_delta_wl2,num_samples) + lam*Wl2*2
    big_delta_bl2 = np.true_divide(big_delta_bl2,num_samples)
    big_delta_wl1 = np.true_divide(big_delta_wl1,num_samples) + lam*Wl1*2
    big_delta_bl1 = np.true_divide(big_delta_bl1,num_samples)
    # return 
    return np.concatenate([big_delta_wl1.ravel(),
                           big_delta_bl1,
                           big_delta_wl2.ravel(),
                           big_delta_bl2.reshape(big_delta_bl2.size)])
现在前馈:
def _forward_prop(self,W,X,transfer_func=sigmoid):
    """
    Return the output of the net a Numsamples (N) x Outputs (No CLasses)
    # an array containing the size of the output of all of the laye of the neural net
    """
    # Hidden layer DxHLS
    weights_L1,bias_L1,weights_L2,bias_L2 = self._extract_weights(W)    
    # Output layer HLSxOUT
    # A_2 = N x HLS
    A_2 = transfer_func(np.dot(X,weights_L1) + bias_L1 )
    # A_3 = N x  Outputs
    A_3 = transfer_func(np.dot(A_2,weights_L2) + bias_L2)
    # output layer
    return [A_2,A_3]
梯度检查的成本函数:
 def cost_function(self,W,X,labels,reg=0.001):
    """
    reg: regularization term
    No weight decay term - lets leave it for later
    """
    outputs = self._forward_prop(W,X,sigmoid)[-1] #take the last layer out
    sample_size = X.shape[0]
    y = self.make_1_of_c_encoding(labels)
    e1 = np.sum((outputs - y)**2, axis=1))*0.5
    #error = e1.sum(axis=1)
    error = e1.sum()/sample_size + 0.5*reg*(np.square(W)).sum()
    return error