Multi-layer neural network back-propagation formula (using stochastic gradient descent)












0















Using the notations from Backpropagation calculus | Deep learning, chapter 4, I have this back-propagation code for a 4-layer (i.e. 2 hidden layers) neural network:



def sigmoid_prime(z): 
return z * (1-z) # because σ'(x) = σ(x) (1 - σ(x))

def train(self, input_vector, target_vector):
a = np.array(input_vector, ndmin=2).T
y = np.array(target_vector, ndmin=2).T

# forward
A = [a]
for k in range(3):
a = sigmoid(np.dot(self.weights[k], a)) # zero bias here just for simplicity
A.append(a)

# Now A has 4 elements: the input vector + the 3 outputs vectors

# back-propagation
delta = a - y
for k in [2, 1, 0]:
tmp = delta * sigmoid_prime(A[k+1])
delta = np.dot(self.weights[k].T, tmp) # (1) <---- HERE
self.weights[k] -= self.learning_rate * np.dot(tmp, A[k].T)


It works, but:





  • the accuracy at the end (for my use case: MNIST digit recognition) is just ok, but not very good.
    It is much better (i.e. the convergence is much better) when the line (1) is replaced by:



    delta = np.dot(self.weights[k].T, delta)  # (2)



  • the code from Machine Learning with Python: Training and Testing the Neural Network with MNIST data set also suggests:



    delta = np.dot(self.weights[k].T, delta)


    instead of:



    delta = np.dot(self.weights[k].T, tmp)


    (With the notations of this article, it is:



    output_errors = np.dot(self.weights_matrices[layer_index-1].T, output_errors)


    )




These 2 arguments seem to be concordant: code (2) is better than code (1).



However, the math seem to show the contrary (see video here; another detail: note that my loss function is multiplied by 1/2 whereas it's not on the video):



enter image description here



Question: which one is correct: the implementation (1) or (2)?





In LaTeX:



$$frac{partial{C}}{partial{w^{L-1}}} = frac{partial{z^{L-1}}}{partial{w^{L-1}}} frac{partial{a^{L-1}}}{partial{z^{L-1}}} frac{partial{C}}{partial{a^{L-1}}}=a^{L-2} sigma'(z^{L-1}) times w^L sigma'(z^L)(a^L-y) $$
$$frac{partial{C}}{partial{w^L}} = frac{partial{z^L}}{partial{w^L}} frac{partial{a^L}}{partial{z^L}} frac{partial{C}}{partial{a^L}}=a^{L-1} sigma'(z^L)(a^L-y)$$
$$frac{partial{C}}{partial{a^{L-1}}} = frac{partial{z^L}}{partial{a^{L-1}}} frac{partial{a^L}}{partial{z^L}} frac{partial{C}}{partial{a^L}}=w^L sigma'(z^L)(a^L-y)$$









share|improve this question

























  • What do you mean by "correct" exactly? Why do you think only one of them can be correct?

    – Goyo
    Nov 13 '18 at 22:50











  • @Goyo: solution (1) seems to be coherent with the math (partial derivative computations, stochastic gradient descent), but the result is not so good. Solution (2) gives far better result (for an identical neural network, a very standard one for digit recognition) AND is the one that I found in the implementation I linked (see URL). But it does not seem to be coherent with the partial derivatives... thus the question: why is (2) better? Is it a well known trick? Or just random experimental observations "that work" so we use it.

    – Basj
    Nov 13 '18 at 23:06













  • Las time I checked nobody really knew why some things worked and some others didn't or why in some cases and not in other cases. It was all hit-and-miss. Intuition based on biology, physics or mathematics served as inspiration but strictly following those intuitions wouldn't always yield the best results. Maybe the previous chapters explain why Bernd Klein chose that implementation. If not you might want to ask him.

    – Goyo
    Nov 13 '18 at 23:56











  • Yes maybe... do you confirm @Goyo that the math would be more in favour of (1) or did I do a mistake?

    – Basj
    Nov 14 '18 at 0:07











  • I am afraid I am not qualified to answer that question. But Bernd Klein's web has an explanation of the back-propagation algorithm. Surprisingly you do not mention it in the question. Do you think it is consistent with his implementation? With the explanations in the video? Why?

    – Goyo
    Nov 14 '18 at 18:28


















0















Using the notations from Backpropagation calculus | Deep learning, chapter 4, I have this back-propagation code for a 4-layer (i.e. 2 hidden layers) neural network:



def sigmoid_prime(z): 
return z * (1-z) # because σ'(x) = σ(x) (1 - σ(x))

def train(self, input_vector, target_vector):
a = np.array(input_vector, ndmin=2).T
y = np.array(target_vector, ndmin=2).T

# forward
A = [a]
for k in range(3):
a = sigmoid(np.dot(self.weights[k], a)) # zero bias here just for simplicity
A.append(a)

# Now A has 4 elements: the input vector + the 3 outputs vectors

# back-propagation
delta = a - y
for k in [2, 1, 0]:
tmp = delta * sigmoid_prime(A[k+1])
delta = np.dot(self.weights[k].T, tmp) # (1) <---- HERE
self.weights[k] -= self.learning_rate * np.dot(tmp, A[k].T)


It works, but:





  • the accuracy at the end (for my use case: MNIST digit recognition) is just ok, but not very good.
    It is much better (i.e. the convergence is much better) when the line (1) is replaced by:



    delta = np.dot(self.weights[k].T, delta)  # (2)



  • the code from Machine Learning with Python: Training and Testing the Neural Network with MNIST data set also suggests:



    delta = np.dot(self.weights[k].T, delta)


    instead of:



    delta = np.dot(self.weights[k].T, tmp)


    (With the notations of this article, it is:



    output_errors = np.dot(self.weights_matrices[layer_index-1].T, output_errors)


    )




These 2 arguments seem to be concordant: code (2) is better than code (1).



However, the math seem to show the contrary (see video here; another detail: note that my loss function is multiplied by 1/2 whereas it's not on the video):



enter image description here



Question: which one is correct: the implementation (1) or (2)?





In LaTeX:



$$frac{partial{C}}{partial{w^{L-1}}} = frac{partial{z^{L-1}}}{partial{w^{L-1}}} frac{partial{a^{L-1}}}{partial{z^{L-1}}} frac{partial{C}}{partial{a^{L-1}}}=a^{L-2} sigma'(z^{L-1}) times w^L sigma'(z^L)(a^L-y) $$
$$frac{partial{C}}{partial{w^L}} = frac{partial{z^L}}{partial{w^L}} frac{partial{a^L}}{partial{z^L}} frac{partial{C}}{partial{a^L}}=a^{L-1} sigma'(z^L)(a^L-y)$$
$$frac{partial{C}}{partial{a^{L-1}}} = frac{partial{z^L}}{partial{a^{L-1}}} frac{partial{a^L}}{partial{z^L}} frac{partial{C}}{partial{a^L}}=w^L sigma'(z^L)(a^L-y)$$









share|improve this question

























  • What do you mean by "correct" exactly? Why do you think only one of them can be correct?

    – Goyo
    Nov 13 '18 at 22:50











  • @Goyo: solution (1) seems to be coherent with the math (partial derivative computations, stochastic gradient descent), but the result is not so good. Solution (2) gives far better result (for an identical neural network, a very standard one for digit recognition) AND is the one that I found in the implementation I linked (see URL). But it does not seem to be coherent with the partial derivatives... thus the question: why is (2) better? Is it a well known trick? Or just random experimental observations "that work" so we use it.

    – Basj
    Nov 13 '18 at 23:06













  • Las time I checked nobody really knew why some things worked and some others didn't or why in some cases and not in other cases. It was all hit-and-miss. Intuition based on biology, physics or mathematics served as inspiration but strictly following those intuitions wouldn't always yield the best results. Maybe the previous chapters explain why Bernd Klein chose that implementation. If not you might want to ask him.

    – Goyo
    Nov 13 '18 at 23:56











  • Yes maybe... do you confirm @Goyo that the math would be more in favour of (1) or did I do a mistake?

    – Basj
    Nov 14 '18 at 0:07











  • I am afraid I am not qualified to answer that question. But Bernd Klein's web has an explanation of the back-propagation algorithm. Surprisingly you do not mention it in the question. Do you think it is consistent with his implementation? With the explanations in the video? Why?

    – Goyo
    Nov 14 '18 at 18:28
















0












0








0








Using the notations from Backpropagation calculus | Deep learning, chapter 4, I have this back-propagation code for a 4-layer (i.e. 2 hidden layers) neural network:



def sigmoid_prime(z): 
return z * (1-z) # because σ'(x) = σ(x) (1 - σ(x))

def train(self, input_vector, target_vector):
a = np.array(input_vector, ndmin=2).T
y = np.array(target_vector, ndmin=2).T

# forward
A = [a]
for k in range(3):
a = sigmoid(np.dot(self.weights[k], a)) # zero bias here just for simplicity
A.append(a)

# Now A has 4 elements: the input vector + the 3 outputs vectors

# back-propagation
delta = a - y
for k in [2, 1, 0]:
tmp = delta * sigmoid_prime(A[k+1])
delta = np.dot(self.weights[k].T, tmp) # (1) <---- HERE
self.weights[k] -= self.learning_rate * np.dot(tmp, A[k].T)


It works, but:





  • the accuracy at the end (for my use case: MNIST digit recognition) is just ok, but not very good.
    It is much better (i.e. the convergence is much better) when the line (1) is replaced by:



    delta = np.dot(self.weights[k].T, delta)  # (2)



  • the code from Machine Learning with Python: Training and Testing the Neural Network with MNIST data set also suggests:



    delta = np.dot(self.weights[k].T, delta)


    instead of:



    delta = np.dot(self.weights[k].T, tmp)


    (With the notations of this article, it is:



    output_errors = np.dot(self.weights_matrices[layer_index-1].T, output_errors)


    )




These 2 arguments seem to be concordant: code (2) is better than code (1).



However, the math seem to show the contrary (see video here; another detail: note that my loss function is multiplied by 1/2 whereas it's not on the video):



enter image description here



Question: which one is correct: the implementation (1) or (2)?





In LaTeX:



$$frac{partial{C}}{partial{w^{L-1}}} = frac{partial{z^{L-1}}}{partial{w^{L-1}}} frac{partial{a^{L-1}}}{partial{z^{L-1}}} frac{partial{C}}{partial{a^{L-1}}}=a^{L-2} sigma'(z^{L-1}) times w^L sigma'(z^L)(a^L-y) $$
$$frac{partial{C}}{partial{w^L}} = frac{partial{z^L}}{partial{w^L}} frac{partial{a^L}}{partial{z^L}} frac{partial{C}}{partial{a^L}}=a^{L-1} sigma'(z^L)(a^L-y)$$
$$frac{partial{C}}{partial{a^{L-1}}} = frac{partial{z^L}}{partial{a^{L-1}}} frac{partial{a^L}}{partial{z^L}} frac{partial{C}}{partial{a^L}}=w^L sigma'(z^L)(a^L-y)$$









share|improve this question
















Using the notations from Backpropagation calculus | Deep learning, chapter 4, I have this back-propagation code for a 4-layer (i.e. 2 hidden layers) neural network:



def sigmoid_prime(z): 
return z * (1-z) # because σ'(x) = σ(x) (1 - σ(x))

def train(self, input_vector, target_vector):
a = np.array(input_vector, ndmin=2).T
y = np.array(target_vector, ndmin=2).T

# forward
A = [a]
for k in range(3):
a = sigmoid(np.dot(self.weights[k], a)) # zero bias here just for simplicity
A.append(a)

# Now A has 4 elements: the input vector + the 3 outputs vectors

# back-propagation
delta = a - y
for k in [2, 1, 0]:
tmp = delta * sigmoid_prime(A[k+1])
delta = np.dot(self.weights[k].T, tmp) # (1) <---- HERE
self.weights[k] -= self.learning_rate * np.dot(tmp, A[k].T)


It works, but:





  • the accuracy at the end (for my use case: MNIST digit recognition) is just ok, but not very good.
    It is much better (i.e. the convergence is much better) when the line (1) is replaced by:



    delta = np.dot(self.weights[k].T, delta)  # (2)



  • the code from Machine Learning with Python: Training and Testing the Neural Network with MNIST data set also suggests:



    delta = np.dot(self.weights[k].T, delta)


    instead of:



    delta = np.dot(self.weights[k].T, tmp)


    (With the notations of this article, it is:



    output_errors = np.dot(self.weights_matrices[layer_index-1].T, output_errors)


    )




These 2 arguments seem to be concordant: code (2) is better than code (1).



However, the math seem to show the contrary (see video here; another detail: note that my loss function is multiplied by 1/2 whereas it's not on the video):



enter image description here



Question: which one is correct: the implementation (1) or (2)?





In LaTeX:



$$frac{partial{C}}{partial{w^{L-1}}} = frac{partial{z^{L-1}}}{partial{w^{L-1}}} frac{partial{a^{L-1}}}{partial{z^{L-1}}} frac{partial{C}}{partial{a^{L-1}}}=a^{L-2} sigma'(z^{L-1}) times w^L sigma'(z^L)(a^L-y) $$
$$frac{partial{C}}{partial{w^L}} = frac{partial{z^L}}{partial{w^L}} frac{partial{a^L}}{partial{z^L}} frac{partial{C}}{partial{a^L}}=a^{L-1} sigma'(z^L)(a^L-y)$$
$$frac{partial{C}}{partial{a^{L-1}}} = frac{partial{z^L}}{partial{a^{L-1}}} frac{partial{a^L}}{partial{z^L}} frac{partial{C}}{partial{a^L}}=w^L sigma'(z^L)(a^L-y)$$






python machine-learning neural-network backpropagation gradient-descent






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 14 '18 at 21:02







Basj

















asked Nov 13 '18 at 18:05









BasjBasj

5,82030105230




5,82030105230













  • What do you mean by "correct" exactly? Why do you think only one of them can be correct?

    – Goyo
    Nov 13 '18 at 22:50











  • @Goyo: solution (1) seems to be coherent with the math (partial derivative computations, stochastic gradient descent), but the result is not so good. Solution (2) gives far better result (for an identical neural network, a very standard one for digit recognition) AND is the one that I found in the implementation I linked (see URL). But it does not seem to be coherent with the partial derivatives... thus the question: why is (2) better? Is it a well known trick? Or just random experimental observations "that work" so we use it.

    – Basj
    Nov 13 '18 at 23:06













  • Las time I checked nobody really knew why some things worked and some others didn't or why in some cases and not in other cases. It was all hit-and-miss. Intuition based on biology, physics or mathematics served as inspiration but strictly following those intuitions wouldn't always yield the best results. Maybe the previous chapters explain why Bernd Klein chose that implementation. If not you might want to ask him.

    – Goyo
    Nov 13 '18 at 23:56











  • Yes maybe... do you confirm @Goyo that the math would be more in favour of (1) or did I do a mistake?

    – Basj
    Nov 14 '18 at 0:07











  • I am afraid I am not qualified to answer that question. But Bernd Klein's web has an explanation of the back-propagation algorithm. Surprisingly you do not mention it in the question. Do you think it is consistent with his implementation? With the explanations in the video? Why?

    – Goyo
    Nov 14 '18 at 18:28





















  • What do you mean by "correct" exactly? Why do you think only one of them can be correct?

    – Goyo
    Nov 13 '18 at 22:50











  • @Goyo: solution (1) seems to be coherent with the math (partial derivative computations, stochastic gradient descent), but the result is not so good. Solution (2) gives far better result (for an identical neural network, a very standard one for digit recognition) AND is the one that I found in the implementation I linked (see URL). But it does not seem to be coherent with the partial derivatives... thus the question: why is (2) better? Is it a well known trick? Or just random experimental observations "that work" so we use it.

    – Basj
    Nov 13 '18 at 23:06













  • Las time I checked nobody really knew why some things worked and some others didn't or why in some cases and not in other cases. It was all hit-and-miss. Intuition based on biology, physics or mathematics served as inspiration but strictly following those intuitions wouldn't always yield the best results. Maybe the previous chapters explain why Bernd Klein chose that implementation. If not you might want to ask him.

    – Goyo
    Nov 13 '18 at 23:56











  • Yes maybe... do you confirm @Goyo that the math would be more in favour of (1) or did I do a mistake?

    – Basj
    Nov 14 '18 at 0:07











  • I am afraid I am not qualified to answer that question. But Bernd Klein's web has an explanation of the back-propagation algorithm. Surprisingly you do not mention it in the question. Do you think it is consistent with his implementation? With the explanations in the video? Why?

    – Goyo
    Nov 14 '18 at 18:28



















What do you mean by "correct" exactly? Why do you think only one of them can be correct?

– Goyo
Nov 13 '18 at 22:50





What do you mean by "correct" exactly? Why do you think only one of them can be correct?

– Goyo
Nov 13 '18 at 22:50













@Goyo: solution (1) seems to be coherent with the math (partial derivative computations, stochastic gradient descent), but the result is not so good. Solution (2) gives far better result (for an identical neural network, a very standard one for digit recognition) AND is the one that I found in the implementation I linked (see URL). But it does not seem to be coherent with the partial derivatives... thus the question: why is (2) better? Is it a well known trick? Or just random experimental observations "that work" so we use it.

– Basj
Nov 13 '18 at 23:06







@Goyo: solution (1) seems to be coherent with the math (partial derivative computations, stochastic gradient descent), but the result is not so good. Solution (2) gives far better result (for an identical neural network, a very standard one for digit recognition) AND is the one that I found in the implementation I linked (see URL). But it does not seem to be coherent with the partial derivatives... thus the question: why is (2) better? Is it a well known trick? Or just random experimental observations "that work" so we use it.

– Basj
Nov 13 '18 at 23:06















Las time I checked nobody really knew why some things worked and some others didn't or why in some cases and not in other cases. It was all hit-and-miss. Intuition based on biology, physics or mathematics served as inspiration but strictly following those intuitions wouldn't always yield the best results. Maybe the previous chapters explain why Bernd Klein chose that implementation. If not you might want to ask him.

– Goyo
Nov 13 '18 at 23:56





Las time I checked nobody really knew why some things worked and some others didn't or why in some cases and not in other cases. It was all hit-and-miss. Intuition based on biology, physics or mathematics served as inspiration but strictly following those intuitions wouldn't always yield the best results. Maybe the previous chapters explain why Bernd Klein chose that implementation. If not you might want to ask him.

– Goyo
Nov 13 '18 at 23:56













Yes maybe... do you confirm @Goyo that the math would be more in favour of (1) or did I do a mistake?

– Basj
Nov 14 '18 at 0:07





Yes maybe... do you confirm @Goyo that the math would be more in favour of (1) or did I do a mistake?

– Basj
Nov 14 '18 at 0:07













I am afraid I am not qualified to answer that question. But Bernd Klein's web has an explanation of the back-propagation algorithm. Surprisingly you do not mention it in the question. Do you think it is consistent with his implementation? With the explanations in the video? Why?

– Goyo
Nov 14 '18 at 18:28







I am afraid I am not qualified to answer that question. But Bernd Klein's web has an explanation of the back-propagation algorithm. Surprisingly you do not mention it in the question. Do you think it is consistent with his implementation? With the explanations in the video? Why?

– Goyo
Nov 14 '18 at 18:28














1 Answer
1






active

oldest

votes


















0














I spent two days to analyze this problem, I filled a few pages of notebook with partial derivative computations... and I can confirm:




  • the maths written in LaTeX in the question are correct


  • the code (1) is the correct one, and it agrees with the math computations:



    delta = a - y
    for k in [2, 1, 0]:
    tmp = delta * sigmoid_prime(A[k+1])
    delta = np.dot(self.weights[k].T, tmp)
    self.weights[k] -= self.learning_rate * np.dot(tmp, A[k].T)



  • code (2) is wrong:



    delta = a - y
    for k in [2, 1, 0]:
    tmp = delta * sigmoid_prime(A[k+1])
    delta = np.dot(self.weights[k].T, delta) # WRONG HERE
    self.weights[k] -= self.learning_rate * np.dot(tmp, A[k].T)


    and there a slight mistake in Machine Learning with Python: Training and Testing the Neural Network with MNIST data set:



    output_errors = np.dot(self.weights_matrices[layer_index-1].T, output_errors)


    should be



    output_errors = np.dot(self.weights_matrices[layer_index-1].T, output_errors * out_vector * (1.0 - out_vector))



Now the difficult part that took me days to realize:




  • Apparently the code (2) has a far better convergence than code (1), that's why I mislead into thinking code (2) was correct and code (1) was wrong


  • ... But in fact that's just a coincidence because the learning_rate was set too low. Here is the reason: when using code (2), the parameter delta is growing much faster (print np.linalg.norm(delta) helps to see this) than with the code (1).


  • Thus "incorrect code (2)" just compensated the "too slow learning rate" by having a bigger delta parameter, and it lead, in some cases, to an apparently faster convergence.



Now solved!






share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53287032%2fmulti-layer-neural-network-back-propagation-formula-using-stochastic-gradient-d%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    I spent two days to analyze this problem, I filled a few pages of notebook with partial derivative computations... and I can confirm:




    • the maths written in LaTeX in the question are correct


    • the code (1) is the correct one, and it agrees with the math computations:



      delta = a - y
      for k in [2, 1, 0]:
      tmp = delta * sigmoid_prime(A[k+1])
      delta = np.dot(self.weights[k].T, tmp)
      self.weights[k] -= self.learning_rate * np.dot(tmp, A[k].T)



    • code (2) is wrong:



      delta = a - y
      for k in [2, 1, 0]:
      tmp = delta * sigmoid_prime(A[k+1])
      delta = np.dot(self.weights[k].T, delta) # WRONG HERE
      self.weights[k] -= self.learning_rate * np.dot(tmp, A[k].T)


      and there a slight mistake in Machine Learning with Python: Training and Testing the Neural Network with MNIST data set:



      output_errors = np.dot(self.weights_matrices[layer_index-1].T, output_errors)


      should be



      output_errors = np.dot(self.weights_matrices[layer_index-1].T, output_errors * out_vector * (1.0 - out_vector))



    Now the difficult part that took me days to realize:




    • Apparently the code (2) has a far better convergence than code (1), that's why I mislead into thinking code (2) was correct and code (1) was wrong


    • ... But in fact that's just a coincidence because the learning_rate was set too low. Here is the reason: when using code (2), the parameter delta is growing much faster (print np.linalg.norm(delta) helps to see this) than with the code (1).


    • Thus "incorrect code (2)" just compensated the "too slow learning rate" by having a bigger delta parameter, and it lead, in some cases, to an apparently faster convergence.



    Now solved!






    share|improve this answer




























      0














      I spent two days to analyze this problem, I filled a few pages of notebook with partial derivative computations... and I can confirm:




      • the maths written in LaTeX in the question are correct


      • the code (1) is the correct one, and it agrees with the math computations:



        delta = a - y
        for k in [2, 1, 0]:
        tmp = delta * sigmoid_prime(A[k+1])
        delta = np.dot(self.weights[k].T, tmp)
        self.weights[k] -= self.learning_rate * np.dot(tmp, A[k].T)



      • code (2) is wrong:



        delta = a - y
        for k in [2, 1, 0]:
        tmp = delta * sigmoid_prime(A[k+1])
        delta = np.dot(self.weights[k].T, delta) # WRONG HERE
        self.weights[k] -= self.learning_rate * np.dot(tmp, A[k].T)


        and there a slight mistake in Machine Learning with Python: Training and Testing the Neural Network with MNIST data set:



        output_errors = np.dot(self.weights_matrices[layer_index-1].T, output_errors)


        should be



        output_errors = np.dot(self.weights_matrices[layer_index-1].T, output_errors * out_vector * (1.0 - out_vector))



      Now the difficult part that took me days to realize:




      • Apparently the code (2) has a far better convergence than code (1), that's why I mislead into thinking code (2) was correct and code (1) was wrong


      • ... But in fact that's just a coincidence because the learning_rate was set too low. Here is the reason: when using code (2), the parameter delta is growing much faster (print np.linalg.norm(delta) helps to see this) than with the code (1).


      • Thus "incorrect code (2)" just compensated the "too slow learning rate" by having a bigger delta parameter, and it lead, in some cases, to an apparently faster convergence.



      Now solved!






      share|improve this answer


























        0












        0








        0







        I spent two days to analyze this problem, I filled a few pages of notebook with partial derivative computations... and I can confirm:




        • the maths written in LaTeX in the question are correct


        • the code (1) is the correct one, and it agrees with the math computations:



          delta = a - y
          for k in [2, 1, 0]:
          tmp = delta * sigmoid_prime(A[k+1])
          delta = np.dot(self.weights[k].T, tmp)
          self.weights[k] -= self.learning_rate * np.dot(tmp, A[k].T)



        • code (2) is wrong:



          delta = a - y
          for k in [2, 1, 0]:
          tmp = delta * sigmoid_prime(A[k+1])
          delta = np.dot(self.weights[k].T, delta) # WRONG HERE
          self.weights[k] -= self.learning_rate * np.dot(tmp, A[k].T)


          and there a slight mistake in Machine Learning with Python: Training and Testing the Neural Network with MNIST data set:



          output_errors = np.dot(self.weights_matrices[layer_index-1].T, output_errors)


          should be



          output_errors = np.dot(self.weights_matrices[layer_index-1].T, output_errors * out_vector * (1.0 - out_vector))



        Now the difficult part that took me days to realize:




        • Apparently the code (2) has a far better convergence than code (1), that's why I mislead into thinking code (2) was correct and code (1) was wrong


        • ... But in fact that's just a coincidence because the learning_rate was set too low. Here is the reason: when using code (2), the parameter delta is growing much faster (print np.linalg.norm(delta) helps to see this) than with the code (1).


        • Thus "incorrect code (2)" just compensated the "too slow learning rate" by having a bigger delta parameter, and it lead, in some cases, to an apparently faster convergence.



        Now solved!






        share|improve this answer













        I spent two days to analyze this problem, I filled a few pages of notebook with partial derivative computations... and I can confirm:




        • the maths written in LaTeX in the question are correct


        • the code (1) is the correct one, and it agrees with the math computations:



          delta = a - y
          for k in [2, 1, 0]:
          tmp = delta * sigmoid_prime(A[k+1])
          delta = np.dot(self.weights[k].T, tmp)
          self.weights[k] -= self.learning_rate * np.dot(tmp, A[k].T)



        • code (2) is wrong:



          delta = a - y
          for k in [2, 1, 0]:
          tmp = delta * sigmoid_prime(A[k+1])
          delta = np.dot(self.weights[k].T, delta) # WRONG HERE
          self.weights[k] -= self.learning_rate * np.dot(tmp, A[k].T)


          and there a slight mistake in Machine Learning with Python: Training and Testing the Neural Network with MNIST data set:



          output_errors = np.dot(self.weights_matrices[layer_index-1].T, output_errors)


          should be



          output_errors = np.dot(self.weights_matrices[layer_index-1].T, output_errors * out_vector * (1.0 - out_vector))



        Now the difficult part that took me days to realize:




        • Apparently the code (2) has a far better convergence than code (1), that's why I mislead into thinking code (2) was correct and code (1) was wrong


        • ... But in fact that's just a coincidence because the learning_rate was set too low. Here is the reason: when using code (2), the parameter delta is growing much faster (print np.linalg.norm(delta) helps to see this) than with the code (1).


        • Thus "incorrect code (2)" just compensated the "too slow learning rate" by having a bigger delta parameter, and it lead, in some cases, to an apparently faster convergence.



        Now solved!







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 15 '18 at 16:09









        BasjBasj

        5,82030105230




        5,82030105230






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53287032%2fmulti-layer-neural-network-back-propagation-formula-using-stochastic-gradient-d%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Florida Star v. B. J. F.

            Danny Elfman

            Lugert, Oklahoma