After re-reading the relevant papers and looking up in a textbook I think the documentation of encog is not correct at this point. Why don't you just try it out by temporarily adding the minus-signs in the source code? If you use the same initial weights, you should receive exact the same results, given the documentation was correct. But in the end it just matters how you use the weightUpdate variable. If the author of the documentation is used to subtracting the weightUpdate from the weights instead of adding it, this will work.
Edit: I revisited the part about the gradient calculation in my original answer.
First, here is a brief explanation on how you can imagine the gradient for the weights in your output layer. First, you calculate the error between your outputs and the target values.
What you are now trying to do is to "blame" those neurons in the previous layer, which were active. Imagine the output neuron saying "Well, I have an error here, who is responsible?". Responsible are the neurons of the previous layer. Depending on the output being too small or too large compared to the target value, it will increase or decrease the weights to each of the neurons in the previous layers depending on how active they have been.
x is the activation of a neuron in the hidden layer.
o is the activation of the output neuron.
φ is the activation function of the output neuron, φ' its derivative.
Edit2: Corrected the part below. Added matrix style computation of backpropagation.
The error at each output neuron j is:
(1) δout, j = φ'(oj)(t - oj)
The gradient for the weight connecting the hidden neuron i with the output neuron j:
(2) gradi, j = xi * δout, j
The backpropagated error at each hidden neuron i with the weights w:
(3) δhid, i = φ'(x)*∑wi, j * δout, j
By repeatedly applying formula 2 and 3, you can backpropagate up to the input layer.
Written in loops, regarding one training sample:
The error at each output neuron j is:
for(int j=0; j < numOutNeurons; j++) {
errorOut[j] = activationDerivative(o[j])*(t[j] - o[j]);
}
The gradient for the weight connecting the hidden neuron i with the output neuron j:
for(int i=0; i < numHidNeurons; i++) {
for(int j=0; j < numOutNeurons; j++) {
grad[i][j] = x[i] * errorOut[j]
}
}
The backpropagated error at each hidden neuron i:
for(int i=0; i < numHidNeurons; i++) {
for(int j=0; j < numOutNeurons; j++) {
errorHid[i] = activationDerivative(x[i]) * weights[i][j] * errorOut[j]
}
}
In fully connected Multilayer Perceptrons without convolution or anything like that you can can use standard matrix operations, which is a lot faster.
Assuming each of your samples is a row in your input matrix and the columns are its attributes, you can propagate the input through your network like this:
activations[0] = input;
for(int i=0; i < numWeightMatrices; i++){
activations[i+1] = activations[i].dot(weightMatrices[i]);
activations[i+1] = activationFunction(activations[i+1]);
}
Backpropagation then becomes:
n = numWeightMatrices;
error = activationDerivative(activations[n]) * (target - activations[n]);
for (int l=n-1; l >= 0; l--){
gradient[l] = activations[l].transposed().dot(error);
if (l > 0) {
error = error.dot(weightMatrices[l].transposed());
error = activationDerivative(activations[l])*error;
}
}
I omitted the bias neuron in the above explanations. In literature it is recommended to model the bias neuron as an additional column in each activation matrix which is alway 1.0 . You will need to deal with some slice assigns. When using the matrix backpropagation loop, do not forget to set the error at the position of the bias to 0 before each step!